Closed thiziri closed 4 years ago
Have you tried mz.pack
? That's the function you should use to create your own DataPack
. This function tries to guarantee your DataPack
's correctness. After creating a DataPack
this way, you may then insert your own columns to desired places. The problem you are having is most likely due to some index mismatching.
Could you please give an example? or a link to the corresponding usage description? Thanks
Thank you @uduse it's solved!
I was trying to do the same thing, the reason being trying to avoid having to copy the text several times because my document collection is very big.
Consider you have 100 queries and say 100K documents. If you use this (replicated from here):
dp = mz.DataPack(relation=relation, left = left, right = right )
this should allow you to save the 100K documents only once. The relation df still contains 100*100K rows (in my case actually less, because it depends on the initial retrieval step), however saving each document only once instead of 100 times allows you to save a lot of RAM. However, this always gives me this error:
KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,\n 9,\n ...\n 49990, 49991, 49992, 49993, 49994, 49995, 49996, 49997, 49998,\n 49999],\n dtype='int64', length=50000)] are in the [index]"
At some point, I also got a different error (I cannot replicate it now but I think the difference was the naming of the right
, left
and relation
columns) which looked something like:
KeyError: "None of [Int64Index([ 1-1-1, 1-1-1, 1-1-1, 1-1-1, 1-1-1, 1-1-1, \n , \n ...\n 199-3-2, 199-3-2, 199-3-2, 199-3-2, 199-3-2\n dtype='int64', length=50000)] are in the [index]"
The only difference between the first and second error being, that the second shows the query id (left_id
) from the left
dataset (1-1-1,...,199-3-2
instead of 0,1,2,3...
)
Is there some way of properly constructing a datapack/dataset without using data_pack.pack
, which actually requires copying the document text in a dataframe multiple times for each query?
Thank you in advance!
Okay, in fact I think I figured out how to fix this.
Apparently, the documentation is a bit misleading from the first example here:
>>> left = [
... ['qid1', 'query 1'],
... ['qid2', 'query 2']
... ]
>>> right = [
... ['did1', 'document 1'],
... ['did2', 'document 2']
... ]
>>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
>>> relation_df = pd.DataFrame(relation)
>>> left = pd.DataFrame(left)
>>> right = pd.DataFrame(right)
>>> dp = DataPack(
... relation=relation_df,
... left=left,
... right=right,
... )
>>> len(dp)
2
The following modifications have to be done:
Columns in left
, right
and relation
dfs should be named id_right
, id_left
, text_right
, text_left
The left and right datasets should be indexed by id_left
and id_right
respectively.
This brings us to the following working code of the example above:
dp = mz.DataPack(
relation=relation_df.rename({0:'id_left',
1 : 'id_right',
2 : 'label',
},axis=1),
left=left.rename({0:'id_left',
1 : 'text_left'
},axis=1).set_index('id_left'),
right=right.rename({0:'id_right',
1 : 'text_right'
},axis=1).set_index('id_right'),
)
dp.frame().head()
Can I somehow update the documentation in the example?
I also encountered another problem, I am writing it here for the reference of others:
One should reset the relation
dataframe index, to make sure it ranges from 0...N-1 rows. Otherwise it creates a problem!
@littlewine As I pointed out earlier in the post, you should try to use mz.pack
for packing your own data pack because it handles many of the indexing problems for you. I agree that the documentation is a bit misleading. Maybe we should encourage users to use mz.pack
instead. Also, the documentation is generated from the source code, so if you want to add some stuff to the docs, change the doctest and do a PR.
I created the different dataframes related to the DataPack, as follows:
df
a data frame having these columns:here is the code:
When I run the
DataPack
function, I've got the following error message:Could you please, give a piece of information to help me solve the problem? Thanks