Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch
https://unbabel.github.io/OpenKiwi/
GNU Affero General Public License v3.0
229 stars 48 forks source link

IndexError: tuple index out of range while using sentence-level Predictor-Estimator to train #29

Closed BUAAers closed 5 years ago

BUAAers commented 5 years ago

After successfully training the sentence-level Predictor and then use it to train Estimator, I get an error named IndexError: tuple index out of range ,the details are as follows:

/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/torch/nn/modules/loss.py:443: UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
Traceback (most recent call last):
  File "/home/zwc/python-virtual-environments/env/bin/kiwi", line 10, in <module>
    sys.exit(main())
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/cli/pipelines/train.py", line 141, in main
    train.train_from_options(options)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 78, in run
    self.checkpointer(self, valid_iterator, epoch=epoch)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/trainers/callbacks.py", line 105, in __call__
    eval_stats_summary = trainer.eval_epoch(valid_iterator)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 151, in eval_epoch
    self.stats.update(batch=batch, **outputs)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/stats.py", line 137, in update
    metric.update(**kwargs)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/metrics.py", line 310, in update
    predictions = self.get_predictions_flat(model_out, batch)
  File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/metrics.py", line 104, in get_predictions_flat
    print("^^",predictions.shape[-1])
IndexError: tuple index out of range

in order to find the reason, I look up the /home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/metrics.py, and find the code around line 104:

the original code is :
  def get_predictions_flat(self, model_out, batch):
        predictions = self.get_predictions(model_out).contiguous()
        predictions_flat = predictions.view(-1, predictions.shape[-1]).squeeze()
        token_indices = self.get_token_indices(batch)
        return predictions_flat[token_indices]
I print the message by insert print():
def get_predictions_flat(self, model_out, batch):
        predictions = self.get_predictions(model_out).contiguous()
        print("**",predictions )
        print("^^",predictions.shape[-1])
        predictions_flat = predictions.view(-1, predictions.shape[-1]).squeeze()
        token_indices = self.get_token_indices(batch)
        print("++",token_indices )
        print("--",predictions_flat[token_indices])
        return predictions_flat[token_indices]
then execute it again, the message is like this:

** tensor([0.0949, 0.0955, 0.0961, 0.0940, 0.0974], device='cuda:1', grad_fn=) ^^ 5 ++ 5 -- tensor([0, 1, 2, 3, 4], device='cuda:1')

its al right but when end this epoch, and Saving training state to runs/0/ec661049f6364cfbb4c2e7a9dd1abe9d/epoch_1 ; Saving sentence_scores predictions to runs/0/ec661049f6364cfbb4c2e7a9dd1abe9d/epoch_1/sentence_scores, the error occured because of follow:

** tensor(0.1439, device='cuda:1') File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/metrics.py", line 104, in get_predictions_flat print("^^",predictions.shape[-1])

In other words, when the predictions are not a list but an element, predictions.shape[-1] gives an error. And why it? How to resolve it. I gave my configuration files for train_predictor.yaml and train_estimator.yaml

train_estimator - 副本.yaml.txt train_predictor - 副本.yaml.txt

And another issue, how to train with two gpu instead of just one. such as gpu-id: 1 -> gpu-id: 1,0

captainvera commented 5 years ago

Hello @BUAAers

First of all, thanks for you interest in OpenKiwi! I'm going to address your last question first as it is the simplest one. Currently, we do not support multi-gpu setups, although it is on our roadmap for the future.

As for the reported bug, thanks for reporting it and providing very thorough information! We will look into it and get back to you!

captainvera commented 5 years ago

Hello again,

I managed to reproduce your bug with a hand-made scalar tensor and I'm working on a fix for it. However, I couldn't find a way to reproduce your bug with any of the QE datasets I have available.

This seems to happen when, because of torchtext's bucket sampling there is a batch of only one sentence. We intended to flatten the predictions as a way to simplify comparison with the target labels. However, when the batch of 1 problem happens, somewhere in the pipeline, the tensor is squished and becomes a 0-dim tensor with only a scalar causing this crash.

Does this always happen even if you change the random seed? (which should change sampling order)

@BUAAers could you provide a minimal dataset where you can reproduce the problem? If by reducing the dataset the problem disappears, I would appreciate it if you could point me to the direction of the data you're working with.

BUAAers commented 5 years ago

Sorry, I didn’t respond to you in time. Thank you very much for your work. And I send the data to your email.

captainvera commented 5 years ago

No worries, I'll look into this and get back to you!

captainvera commented 5 years ago

Hello @BUAAers,

Something must have gone wrong with the compression (or uploading) of these files. I can't open them neither with terminal utilities or with the Mac OS unarchiver.

This is the output of unzip:

Archive:  data.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of data.zip or
        data.zip.zip, and cannot find data.zip.ZIP, period.

This normally indicates a corrupt file or a zip that is missing another part (multi-part archives). Could you please try to re-compress and re-upload your data?

Thanks in advance!

captainvera commented 5 years ago

@BUAAers Sorry for the delay in my response

This data is working perfectly! I will get back to you soon with a fix and explanation of the issue. Thanks for your collaboration

captainvera commented 5 years ago

@BUAAers

Unfortunately, I was unable to reproduce the error that you report. I used your yaml and your data and successfully completed 10 epochs of training. 😕

The only thing I changed in your config was the predictor, you load a pre-trained one and I just initialised a new one. This should have no effect in this particular issue.

As I mentioned previously, my best guess is that this error is caused by chance due to the bucket sampling algorithms from torchtext. For now, you can try setting a different random seed (With flag seed) and trying to run the program that way.

We'll keep this issue open as we try to track the source of this error.

captainvera commented 5 years ago

Hey @BUAAers , since we're not able to reproduce this issue I'll close it for now. If you find any way to reproduce it (or if this is reproducible in your system and you're still unable to finish training), please feel free to re-open this!

BUAAers commented 5 years ago

@captainvera First of all, thank you very much for your work. Secondly, my problem appears at the end of each epoch's 99% end instead of the training process, which is also very strange. So is it convenient to give me your train_estimator.yaml and train_predictor.yaml, and the size of paraller data: train-source and train-target, I want to see if it is a problem with some parameter settings.

1

BUAAers commented 5 years ago

As you said, I set different parameters and this problem was solved. Thans very mach again.