NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.85k stars 900 forks source link

Problem of columns overlap when running the dp.frame() function #819

Closed thiziri closed 4 years ago

thiziri commented 4 years ago

I created the different dataframes related to the DataPack, as follows:

'qid1', 'qid2', 'question1', 'question2', 'cos', 'simple_sigmoid', 'awe_sim', 'sent2vec_w2v_sim', 'avg_all', 'sigmodid', 'is_duplicate', 'is_related'

here is the code:

left = []
right = []
relation = []

for _,row in df.iterrows():
    if [row['qid1'], row['question1']] not in left:
        left.append([row['qid1'], row['question1']])
    if [row['qid2'], row['question2']] not in right:
        right.append([row['qid2'], row['question2']])
    relation.append([row['qid1'], row['qid2'], row['is_duplicate']])

# define the corresponding dataframes
relation_df = pd.DataFrame(relation)
left_df = pd.DataFrame(left)
right_df = pd.DataFrame(right)

# set columns
left_df.columns = ['id_left', 'text_left']
right_df.columns = ['id_right', 'text_right']
relation_df.columns = ['id_left', 'id_right', 'label']

# define the data_pack
dp = mz.DataPack(
     relation=relation_df,
     left=left_df,
     right=right_df
) 

When I run the DataPack function, I've got the following error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-134-c2abddd25025> in <module>
      1 # dp.drop_invalid()
----> 2 dp.frame().head()

C:\ProgramData\Anaconda3\lib\site-packages\matchzoo\data_pack\data_pack.py in __call__(self)
    500         def __call__(self):
    501             """:return: A full copy. Equivalant to `frame[:]`."""
--> 502             return self[:]
    503 
    504 

C:\ProgramData\Anaconda3\lib\site-packages\matchzoo\data_pack\data_pack.py in __getitem__(self, index)
    490             right_df = dp.right.loc[
    491                 dp.relation['id_right'][index]].reset_index()
--> 492             joined_table = left_df.join(right_df)
    493             for column in dp.relation.columns:
    494                 if column not in ['id_left', 'id_right']:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   7244         # For SparseDataFrame's benefit
   7245         return self._join_compat(
-> 7246             other, on=on, how=how, lsuffix=lsuffix, rsuffix=rsuffix, sort=sort
   7247         )
   7248 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   7267                 right_index=True,
   7268                 suffixes=(lsuffix, rsuffix),
-> 7269                 sort=sort,
   7270             )
   7271         else:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     81         validate=validate,
     82     )
---> 83     return op.get_result()
     84 
     85 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
    646 
    647         llabels, rlabels = _items_overlap_with_suffix(
--> 648             ldata.items, lsuf, rdata.items, rsuf
    649         )
    650 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   2009         raise ValueError(
   2010             "columns overlap but no suffix specified: "
-> 2011             "{rename}".format(rename=to_rename)
   2012         )
   2013 

ValueError: columns overlap but no suffix specified: Index(['index'], dtype='object')

Could you please, give a piece of information to help me solve the problem? Thanks

uduse commented 4 years ago

Have you tried mz.pack? That's the function you should use to create your own DataPack. This function tries to guarantee your DataPack's correctness. After creating a DataPack this way, you may then insert your own columns to desired places. The problem you are having is most likely due to some index mismatching.

thiziri commented 4 years ago

Could you please give an example? or a link to the corresponding usage description? Thanks

uduse commented 4 years ago

https://matchzoo.readthedocs.io/en/master/matchzoo.data_pack.html#module-matchzoo.data_pack.pack

thiziri commented 4 years ago

Thank you @uduse it's solved!

littlewine commented 4 years ago

I was trying to do the same thing, the reason being trying to avoid having to copy the text several times because my document collection is very big.

Consider you have 100 queries and say 100K documents. If you use this (replicated from here): dp = mz.DataPack(relation=relation, left = left, right = right ) this should allow you to save the 100K documents only once. The relation df still contains 100*100K rows (in my case actually less, because it depends on the initial retrieval step), however saving each document only once instead of 100 times allows you to save a lot of RAM. However, this always gives me this error: KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,\n 9,\n ...\n 49990, 49991, 49992, 49993, 49994, 49995, 49996, 49997, 49998,\n 49999],\n dtype='int64', length=50000)] are in the [index]"

At some point, I also got a different error (I cannot replicate it now but I think the difference was the naming of the right, left and relation columns) which looked something like: KeyError: "None of [Int64Index([ 1-1-1, 1-1-1, 1-1-1, 1-1-1, 1-1-1, 1-1-1, \n , \n ...\n 199-3-2, 199-3-2, 199-3-2, 199-3-2, 199-3-2\n dtype='int64', length=50000)] are in the [index]" The only difference between the first and second error being, that the second shows the query id (left_id) from the left dataset (1-1-1,...,199-3-2 instead of 0,1,2,3...)

Is there some way of properly constructing a datapack/dataset without using data_pack.pack, which actually requires copying the document text in a dataframe multiple times for each query?

Thank you in advance!

littlewine commented 4 years ago

Okay, in fact I think I figured out how to fix this.

Apparently, the documentation is a bit misleading from the first example here:

>>> left = [
...     ['qid1', 'query 1'],
...     ['qid2', 'query 2']
... ]
>>> right = [
...     ['did1', 'document 1'],
...     ['did2', 'document 2']
... ]
>>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
>>> relation_df = pd.DataFrame(relation)
>>> left = pd.DataFrame(left)
>>> right = pd.DataFrame(right)
>>> dp = DataPack(
...     relation=relation_df,
...     left=left,
...     right=right,
... )
>>> len(dp)
2

The following modifications have to be done:

  1. Columns in left, right and relation dfs should be named id_right, id_left, text_right, text_left

  2. The left and right datasets should be indexed by id_left and id_right respectively.

This brings us to the following working code of the example above:

dp = mz.DataPack(
     relation=relation_df.rename({0:'id_left',
              1 : 'id_right',
              2 : 'label',
              },axis=1),

     left=left.rename({0:'id_left',
              1 : 'text_left'
              },axis=1).set_index('id_left'),

     right=right.rename({0:'id_right',
              1 : 'text_right'
              },axis=1).set_index('id_right'),
)

dp.frame().head()

Can I somehow update the documentation in the example?

littlewine commented 4 years ago

I also encountered another problem, I am writing it here for the reference of others: One should reset the relation dataframe index, to make sure it ranges from 0...N-1 rows. Otherwise it creates a problem!

uduse commented 4 years ago

@littlewine As I pointed out earlier in the post, you should try to use mz.pack for packing your own data pack because it handles many of the indexing problems for you. I agree that the documentation is a bit misleading. Maybe we should encourage users to use mz.pack instead. Also, the documentation is generated from the source code, so if you want to add some stuff to the docs, change the doctest and do a PR.