Closed BUAAers closed 5 years ago
Hello @BUAAers
First of all, thanks for you interest in OpenKiwi! I'm going to address your last question first as it is the simplest one. Currently, we do not support multi-gpu setups, although it is on our roadmap for the future.
As for the reported bug, thanks for reporting it and providing very thorough information! We will look into it and get back to you!
Hello again,
I managed to reproduce your bug with a hand-made scalar tensor and I'm working on a fix for it. However, I couldn't find a way to reproduce your bug with any of the QE datasets I have available.
This seems to happen when, because of torchtext's bucket sampling there is a batch of only one sentence. We intended to flatten the predictions as a way to simplify comparison with the target labels. However, when the batch of 1 problem happens, somewhere in the pipeline, the tensor is squished and becomes a 0-dim tensor with only a scalar causing this crash.
Does this always happen even if you change the random seed? (which should change sampling order)
@BUAAers could you provide a minimal dataset where you can reproduce the problem? If by reducing the dataset the problem disappears, I would appreciate it if you could point me to the direction of the data you're working with.
Sorry, I didn’t respond to you in time. Thank you very much for your work. And I send the data to your email.
No worries, I'll look into this and get back to you!
Hello @BUAAers,
Something must have gone wrong with the compression (or uploading) of these files. I can't open them neither with terminal utilities or with the Mac OS unarchiver.
This is the output of unzip
:
Archive: data.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data.zip or
data.zip.zip, and cannot find data.zip.ZIP, period.
This normally indicates a corrupt file or a zip that is missing another part (multi-part archives). Could you please try to re-compress and re-upload your data?
Thanks in advance!
@BUAAers Sorry for the delay in my response
This data is working perfectly! I will get back to you soon with a fix and explanation of the issue. Thanks for your collaboration
@BUAAers
Unfortunately, I was unable to reproduce the error that you report. I used your yaml and your data and successfully completed 10 epochs of training. 😕
The only thing I changed in your config was the predictor, you load a pre-trained one and I just initialised a new one. This should have no effect in this particular issue.
As I mentioned previously, my best guess is that this error is caused by chance due to the bucket sampling algorithms from torchtext. For now, you can try setting a different random seed (With flag seed
) and trying to run the program that way.
We'll keep this issue open as we try to track the source of this error.
Hey @BUAAers , since we're not able to reproduce this issue I'll close it for now. If you find any way to reproduce it (or if this is reproducible in your system and you're still unable to finish training), please feel free to re-open this!
@captainvera First of all, thank you very much for your work. Secondly, my problem appears at the end of each epoch's 99% end instead of the training process, which is also very strange. So is it convenient to give me your train_estimator.yaml and train_predictor.yaml, and the size of paraller data: train-source and train-target, I want to see if it is a problem with some parameter settings.
As you said, I set different parameters and this problem was solved. Thans very mach again.
After successfully training the sentence-level Predictor and then use it to train Estimator, I get an error named IndexError: tuple index out of range ,the details are as follows:
in order to find the reason, I look up the /home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/metrics.py, and find the code around line 104:
the original code is :
I print the message by insert print():
then execute it again, the message is like this:
** tensor([0.0949, 0.0955, 0.0961, 0.0940, 0.0974], device='cuda:1', grad_fn=)
^^ 5
++ 5
-- tensor([0, 1, 2, 3, 4], device='cuda:1')
its al right but when end this epoch, and Saving training state to runs/0/ec661049f6364cfbb4c2e7a9dd1abe9d/epoch_1 ; Saving sentence_scores predictions to runs/0/ec661049f6364cfbb4c2e7a9dd1abe9d/epoch_1/sentence_scores, the error occured because of follow:
** tensor(0.1439, device='cuda:1') File "/home/zwc/python-virtual-environments/env/lib/python3.6/site-packages/kiwi/metrics/metrics.py", line 104, in get_predictions_flat print("^^",predictions.shape[-1])
In other words, when the predictions are not a list but an element, predictions.shape[-1] gives an error. And why it? How to resolve it. I gave my configuration files for train_predictor.yaml and train_estimator.yaml
train_estimator - 副本.yaml.txt train_predictor - 副本.yaml.txt
And another issue, how to train with two gpu instead of just one. such as gpu-id: 1 -> gpu-id: 1,0