NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

Index Error #723

Closed voladorlu closed 5 years ago

voladorlu commented 5 years ago

I'm trying to run matching algorithm taking CNN as the neural encoder. However, following error message shows up.

Colocations handled automatically by placer.


Layer (type) Output Shape Param # Connected to

text_left (InputLayer) (None, 30) 0


text_right (InputLayer) (None, 30) 0


embedding (Embedding) (None, 30, 300) 90000 text_left[0][0]
text_right[0][0]


conv1d_1 (Conv1D) (None, 30, 32) 28832 embedding[0][0]


conv1d_2 (Conv1D) (None, 30, 32) 28832 embedding[1][0]


matching_layer_1 (MatchingLayer (None, 30, 30, 32) 0 conv1d_1[0][0]
conv1d_2[0][0]


conv2d_1 (Conv2D) (None, 30, 30, 16) 4624 matching_layer_1[0][0]


max_pooling2d_1 (MaxPooling2D) (None, 15, 15, 16) 0 conv2d_1[0][0]


conv2d_2 (Conv2D) (None, 15, 15, 32) 4640 max_pooling2d_1[0][0]


max_pooling2d_2 (MaxPooling2D) (None, 7, 7, 32) 0 conv2d_2[0][0]


flatten_1 (Flatten) (None, 1568) 0 max_pooling2d_2[0][0]


dropout_1 (Dropout) (None, 1568) 0 flatten_1[0][0]


dense_1 (Dense) (None, 1) 1569 dropout_1[0][0]

Total params: 158,497 Trainable params: 158,497 Non-trainable params: 0


WARNING:tensorflow:From /Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Epoch 1/10 Traceback (most recent call last): File "src/mz-arcii.py", line 109, in model.fit(x, y, batch_size=1024, epochs=10) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/matchzoo/engine/base_model.py", line 250, in fit verbose=verbose, *kwargs) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/keras/engine/training.py", line 1039, in fit validation_steps=validation_steps) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop outs = f(ins_batch) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call return self._call(inputs) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(array_vals) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in call run_metadata_ptr) File "/Users/lyu02/anaconda2/envs/py3-env/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[438,0] = 320 is not in [0, 300) [[{{node embedding_1/embedding_lookup}}]]

Input data format

I did not exactly follow the input format to organize data. In particular, each right_text is format as "Document_id". I'm not sure whether this's the cause. Here is a snapshot of the input file. Could you help me on this issue?

,id_left,text_left,id_right,text_right,label 0,Q46772,michael kors whitney large tote,D91392,tommy hilfiger tie neck blouse apparel accessory clothe top shirt,0.0 1,Q46772,michael kors whitney large tote,D9742,g guess destynn dress boot apparel accessory shoe,0.0 2,Q46772,michael kors whitney large tote,D21,mikasa dinnerware antique white cover casserole home garden kitchen din tableware serveware,0.0 3,Q46772,michael kors whitney large tote,D135194,muddyfox man tour 200 low waterproof cycle shoe apparel accessory shoe,0.0 4,Q46772,michael kors whitney large tote,D67611,adidas man original camo print fleece jogger apparel accessory clothe pant,0.0 5,Q46772,michael kors whitney large tote,D6686,mix chick leave in conditioner 10 oz from purebeauty salon spa health beauty personal care hair care,0.0 6,Q46772,michael kors whitney large tote,D119842,avanti iron work embroider bath towel home garden linen towel bath towel washcloth,0.0 7,Q46772,michael kors whitney large tote,D113699,tommy hilfiger man fit th flex collar performance stretch print dress shirt apparel accessory clothe top,0.0

bwanglzu commented 5 years ago

@voladorlu matchzoo's datapack structure is flexible, but in order to property train your model, please follow the id_left, text_left, id_right, text_right, label format. You can refer to this tutorial.

I'm not quite sure this is the cause, but please format your input data and run the code first, if not solved just pin another message here.

voladorlu commented 5 years ago

Yes, I exactly follow the example to format the data. But I do not know why index error still shows up.Maybe I’m not setting the model correctly? Initialization is just copied from the tutorial.

bwanglzu commented 5 years ago

@voladorlu can you upload your notebook somewhere on github?

voladorlu commented 5 years ago

@bwanglzu I think I find the reason. It's because of param "embedding_input_dim". It can not be set automatically from the corpus, but have to be set manually. It's a little inconvenient. -:)

voladorlu commented 5 years ago

But it will not have this problem when running DSSM without setting this param. It always fails when I tried to run CNN-based methods like ArcII, MatchPyramids. I have to set manually.

bwanglzu commented 5 years ago

@voladorlu yes sometimes you need to set paras manually, this is because different people might have different dimensionality of their embeddings, so it's non-trival to set it to a fixed number. You can refer to this markdown to see the tunable parameters.

voladorlu commented 5 years ago

@bwanglzu That's really helpful. Thank you so much.