Different servers to load the trained PredEst model issue

Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch

https://unbabel.github.io/OpenKiwi/

GNU Affero General Public License v3.0

229 stars 48 forks source link

Different servers to load the trained PredEst model issue #53

Closed HanchengDeng closed 4 years ago

HanchengDeng commented 4 years ago

Hello! @captainvera

I trained a PredEst model on a server, and I could use the "kiwi predict --config predict.yaml" to make a prediction perfectly.

Recently, I wanted to use the trained PredEst model to make a prediction on another new server, and I just moved the trained best_model.torch from the old server to the new one to make it work. However, it does not work and showed [kiwi.lib.predict run:100] Predict with the PredEst (Predictor-Estimator) model Killed. And it seems that the process [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from models/best_model.torch is missing.

I am wondering if there is something wrong or missing with my operation to make the trained model work on the new server. Looking forward to your response. Thank you very much!

captainvera commented 4 years ago

Hello @HanchengDeng,

The behaviour you're describing is unexpected. There's no extra operation to make the trained model work on a new machine.

What could be happening is an unintended consequence of the saving behaviour we implemented in OpenKiwi. You see, when you're training a model and you run the validation scoring the model is instantly saved if it is one of the best runs. If that is the case, in order to avoid using too much disk space, we save the best_model.torch as a path to you're actual best model (which will have a different name). So if the best model is the exact last validation, this best_model will only be a link to the actual file where the model is saved. You can check the behaviour on trainers/callbacks.py save_latest.

Of course, this works when you're in the same machine where you have all the other checkpoints saved but won't work when you move to a new server.

Can you check if this is the case?

HanchengDeng commented 4 years ago

Thank you for your response. The issue had been solved already. I can make the trained model work on a new machine with no extra operation now. Ironically, it was the CPU limitation of my cloud server that led to the issue. I expanded the CPU size of the cloud server, and the trained model could work perfectly.

Thank you very much!

captainvera commented 4 years ago

Hmmm, I can't really understand how a CPU limitation might cause the trained model not to be able to predict..

I'm glad your issue is solved, please let me know if you gain some insight into why this happened!

captainvera commented 4 years ago

This issue has been solved.

Feel free to re-open if you still have problems