Open hrdxwandg opened 3 years ago
Thanks for your feedback. Could you try more epochs? Lightseq can't be exactly the same with torch implementation, because the low-level implementation is totally different and optimized for speed.
We have tested on our in-house pretrain codebase, and after few days, the loss is close.
Thanks for your feedback. Could you try more epochs? Lightseq can't be exactly the same with torch implementation, because the low-level implementation is totally different and optimized for speed.
We have tested on our in-house pretrain codebase, and after few days, the loss is close.
thanks for your reply. If I increase epoch,It must cost more time to reach the target loss, so It Is contradictory with optimized for speed and reduce time?
Thanks for your feedback. Could you try more epochs? Lightseq can't be exactly the same with torch implementation, because the low-level implementation is totally different and optimized for speed. We have tested on our in-house pretrain codebase, and after few days, the loss is close.
thanks for your reply. If I increase epoch,It must cost more time to reach the target loss, so It Is contradictory with optimized for speed and reduce time?
It depends, it should be an end2end time about convergency to measure speed. Many reasons could influence perplexity in just one epoch.
I run pytorch version run_mlm_no_trainer.py with lightseq. But the result is a big difference。 run on my own data without lightseq https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm_no_trainer.py epoch 0: perplexity: 2.847743110923294
run with lightseq epoch 0: perplexity: 3.1440967518031444
the GPU-Util is higher with lightseq. but the train time cost more. I follow the run_ner_no_trainer.py settings