elliottd / GroundedTranslation

Multilingual image description
https://staff.fnwi.uva.nl/d.elliott/GroundedTranslation/
BSD 3-Clause "New" or "Revised" License
46 stars 25 forks source link

IndexError in predict mode #16

Closed evanmiltenburg closed 8 years ago

evanmiltenburg commented 8 years ago

After using c27b2c1ebfe1f9a1bf8985b66ebb03852cf59c38 to fix #15, I get this error. Generation breaks down because there's no reference length.

ubuntu@emiels_machine:~/GroundedTranslation$ python generate.py --model_checkpoints checkpoints/fixed_seed-eng256mlm --dataset /home/ubuntu/image_features --without_scores --mode predict
INFO:data_generator:Initialising data generator
INFO:data_generator:Train/val dataset: /home/ubuntu/image_features
INFO:data_generator:Input gold descriptions
INFO:__main__:Best checkpoint: checkpoints/fixed_seed-eng256mlm/050-11012016-193333
INFO:data_generator:Initialising vocabulary from pre-defined model
INFO:models:Building Keras model...
INFO:models:Using image features: True
INFO:models:Using source language features: False
INFO:models:... visual: adding image features as input features
INFO:models:... with weights defined in checkpoints/fixed_seed-eng256mlm/050-11012016-193333
INFO:__main__:Generating val descriptions
INFO:data_generator:Making generation data for val
Traceback (most recent call last):
  File "generate.py", line 485, in <module>
    w.generationModel()
  File "generate.py", line 87, in generationModel
    self.generate_sentences(self.args.checkpoint, val=not self.args.test)
  File "generate.py", line 115, in generate_sentences
    generation=self.args.use_predicted_tokens)
  File "generate.py", line 312, in make_generation_arrays
    self.use_sourcelang, self.use_image)
  File "/home/ubuntu/GroundedTranslation/data_generator.py", line 360, in get_generation_data_by_split
    arrays[0][d_idx, :, :] = self.format_sequence(d.split())
  File "/home/ubuntu/GroundedTranslation/data_generator.py", line 732, in format_sequence
    seq_array[0, self.word2index[BOS]] += 1  # BOS token at zero timestep
IndexError: index 0 is out of bounds for axis 0 with size 0
evanmiltenburg commented 8 years ago

So format_sequence should probably be avoided altogether for prediction mode. get_generation_data_by_split looks like magic to me (I don't understand what's going on exactly). But I guess this is the required rewrite you were talking about in https://github.com/elliottd/GroundedTranslation/issues/15#issuecomment-196314081.

EDIT: I understand it a little better now!

evanmiltenburg commented 8 years ago

Ok, I just added a hack to set self.data_gen.max_seq_length to 30 (random number). That works!

elliottd commented 8 years ago

This is also related to completely rethinking the data_generator. Your quick-fix will work for the short-term but we still need a long-term rethink of how data_generator should work.