Error when train and inference based on master code

dongdaoking commented 2 years ago

Hi, i want to try some new feature in lightseq and follow here compiling from source in master branch. But when i train and inference follow example, it doesn't work. When training, it seem something wrong in ls_transformer.py.

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
    main(args, **kwargs)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq_cli/train.py", line 68, in main
    model = task.build_model(args)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 327, in build_model
    model = super().build_model(args)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 547, in build_model
    model = models.build_model(args, self)
  File "/Enviroment/Anaconda/anaconda3/envs/pytorch/lib/python3.7/site-packages/fairseq/models/__init__.py", line 58, in build_model
    return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 170, in build_model
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 237, in build_decoder
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 374, in __init__
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 374, in <listcomp>
  File "/tmp/tmp50gvsy1t/fairseq_user_dir_13687/ls_transformer.py", line 409, in build_decoder_layer
ModuleNotFoundError: No module named 'fairseq_user_dir_13687.ls_fs_transformer_decoder_layer'

I try to fix the problem and it can train. But when i try to inference using the checkpoint, i get BLEU=0 and it seems lightseq doesn't work.

So here are questions:

Is the code based on master branch right?
How can i repair the code or the processing?

neopro12 commented 2 years ago

I need to clarify two questions:

Is the bleu score during evaluation right?
Do you inference using pytorch(or export model to lightseq proto)? If yes, then you can checkout what's the difference between the evaluation and inference.

neopro12 commented 2 years ago

https://github.com/bytedance/lightseq/tree/master/examples/inference/python You can try this way to inference after training

dongdaoking commented 2 years ago

Hi, i check the training log.

The bleu score is wrong during evaluation.
inference following the example.

As i describe above, why can't i train based on the master branch directly?

neopro12 commented 2 years ago

The master branch works fine: https://github.com/bytedance/lightseq/blob/master/examples/training/fairseq/ls_fairseq_wmt14en2de.sh Can you give us some detail about your repair to fix the "No module named 'fairseq_user_dir_13687.ls_fs_transformer_decoder_layer'"

dongdaoking commented 2 years ago

Hi, thanks for your reply. I run the command cp lightseq/training/cli/fs_modules/ls_fs_transformer_decoder_layer.py lightseq/training/ops/pytorch/ And point to this path

diff --git a/lightseq/training/cli/fs_modules/ls_transformer.py b/lightseq/training/cli/fs_modules/ls_transformer.py
index a6832ed..015f2fa 100644
--- a/lightseq/training/cli/fs_modules/ls_transformer.py
+++ b/lightseq/training/cli/fs_modules/ls_transformer.py
@@ -406,7 +406,7 @@ class LSTransformerDecoder(FairseqIncrementalDecoder):
                 TransformerDecoderLayer,
             )
         else:
-            from .ls_fs_transformer_decoder_layer import (
+            from lightseq.training.ops.pytorch.ls_fs_transformer_decoder_layer import (
                 LSFSTransformerDecoderLayer as TransformerDecoderLayer,
             )

Oh, i want to make sure our enviroment is the same. Can you provide a based docker image? Now my Enviroment

based docker images nvidia/cuda:11.6.0-cudnn8-devel-ubuntu20.04
pytorch 1.8.0(compile)
cmake 3.20 (compile)
protobuf and HDF5 follow the https://github.com/bytedance/lightseq/blob/master/docs/inference/build.md
git clone --recursive https://github.com/bytedance/lightseq.git

Then i can run the lightseq but meeting the error.

bytedance / lightseq

Error when train and inference based on master code #354