NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

Pair DataGenerator results in array of size (1,) #751

Closed mikelam14 closed 5 years ago

mikelam14 commented 5 years ago

Describe the Question

I have prepared my own dataframes for left, right, relation. For left and right there are columns text{left/right}, length{left/right}, the index is id_{left/right}. All text is already converted into index and is truncated/padded to length=200. When I use the DataGenerator by dev_generator = mz.DataGenerator( dev_processed, mode='pair', num_dup=1, num_neg=1, batch_size=20 ), it returns error of

Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative.

See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike left = self._left.loc[relation['id_left'].unique()] data_shape (40, 200) data_shape (40, 200) data_shape (40, 1) data_shape (40, 200) data_shape (40, 200) data_shape (40, 1) data_shape (40, 1) Traceback (most recent call last): File "", line 3, in File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/matchzoo/engine/base_model.py", line 276, in fit_generator verbose=verbose, *kwargs File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(args, **kwargs) File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/keras/engine/training_generator.py", line 217, in fit_generator class_weight=class_weight) File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/keras/engine/training.py", line 1211, in train_on_batch class_weight=class_weight) File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/keras/engine/training.py", line 751, in _standardize_user_data exception_prefix='input') File "/home/.conda/envs/matchzoo/lib/python3.6/site-packages/keras/engine/training_utils.py", line 138, in standardize_input_data str(data_shape)) ValueError: Error when checking input: expected text_left to have shape (200,) but got array with shape (1,)

I have tried to print out the shapes and the content. I realized there is (nan, nan) pair from the batch_data_pack from batch_data_pack = self._data_pack[indices]. Maybe it is caused by reset_index? Thank you.

Describe your attempts

bwanglzu commented 5 years ago

@mikelam14 can you impute/remove the missing value and try again?

mikelam14 commented 5 years ago

The three dataframes (left, right, relation) has no missing values. Index of left and right, ids in relation are all integers. I group the data by mz.DataPack(relation=rel, left=req, right=adv). However pack.frame() would result in some nans. This affects unpack for sure. I suspect it also affects the reset_index in data_generator. If I change the indices to string, frame() would not work, saying vectors with NAN values cannot be used as index.

uduse commented 5 years ago

Could you provide a minimum working example that reproduces the bug so I can further investigate? Right now I suspect that you have mismatched indices somewhere but I can't be sure. Try calling pack on your data without feeding indices (id_left, id_right) so pack generates the correct ones for you.

uduse commented 5 years ago

I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.