Error when training - Githubissues

sarapapi commented 2 years ago

Dear authors, I have installed both Fairseq with your repository and your version of warprnnt but the following error occurs when I launch the training code:

Traceback (most recent call last):
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/train.py", line 16, in <module>
    cli_main()
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq_cli/train.py", line 475, in cli_main
    parser = options.get_training_parser()
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/options.py", line 36, in get_training_parser
    parser = get_parser("Trainer", default_task)
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/options.py", line 216, in get_parser
    utils.import_user_module(usr_args)
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/utils.py", line 478, in import_user_module
    importlib.import_module(module_name)
  File "/home/spapi/anaconda3/envs/caat_env/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/__init__.py", line 1, in <module>
    from . import tasks, models, data, layers
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/tasks/__init__.py", line 3, in <module>
    from . import s2s_task, transducer_task
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/tasks/s2s_task.py", line 20, in <module>
    import rain.models.transducer as transducer
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/models/__init__.py", line 1, in <module>
    from . import posemb_transformer
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/models/posemb_transformer.py", line 15, in <module>
    from rain.layers.rand_pos import PositionalEmbedding
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/layers/__init__.py", line 6, in <module>
    from .attention_transducer import TransducerMHADecoder
  File "/home/spapi/fairseq_simul/simul_CAAT/fairseq/fairseq/../examples/rain/layers/attention_transducer.py", line 42, in <module>
    from warprnnt_pytorch import DelayTLoss
  File "/home/spapi/caat/warp_transducer/pytorch_binding/warprnnt_pytorch/__init__.py", line 6, in <module>
    from .warp_rnnt import *
ImportError: /home/spapi/caat/warp_transducer/pytorch_binding/warprnnt_pytorch/warp_rnnt.cpython-38-x86_64-linux-gnu.so: undefined symbol: get_delay_workspace_size

Have you experienced something similar? I made some research online but I found nothing. Thank you

EDIT: I tried to install the original warprnnt library and everything works, I successfully import their modules without any error. Thus, I hypothesize that is something related to your modified version.

sarapapi commented 2 years ago

For some, to me unknown, reason I solved copying the warp_transducer/src/attent_entrypoint.cu in the same folder with the name attent_entrypoint.cpp, I compiled again with cmake and everything works.

MenggeLiu commented 1 year ago

I had the same problem, tried to copy the warp_transducer/src/attent_entrypoint.cu file and recompile, the bug was not resolved. Is install the original warprnnt library is necessary? @sarapapi @danliu2

danliu2 commented 1 year ago

Sorry for replying so late. You mustn't install raw warp-rnnt, for my code share the same name with it, which may cause version conflict in python packages. According to the error message you gave, it looks like you are actually using the original version of warp-rnnt, which does not provide the interface I added (get_delay_workspace_size etc.). Please check the .so position and functions exported (by dumpobj etc) to ensure it, test_delay tool . Hope it to be solved smoothly and apologize for my rough code.

MenggeLiu commented 1 year ago

Sorry for replying so late. You mustn't install raw warp-rnnt, for my code share the same name with it, which may cause version conflict in python packages. According to the error message you gave, it looks like you are actually using the original version of warp-rnnt, which does not provide the interface I added (get_delay_workspace_size etc.). Please check the .so position and functions exported (by dumpobj etc) to ensure it, test_delay tool . Hope it to be solved smoothly and apologize for my rough code.

Thanks for your replying, I check the compiling details, reinstall the local cuda and solve the installation problem. But after training 80 epoch on mustc-v2 en-zh dataset, the delay loss and prob_loss are still 'nan', PPL and loss are still high (I skip mt distillation step and I use ASR pretraining in the example. ). Is there any possible solution？ @danliu2

danliu2 commented 1 year ago

Thanks for your replying, I check the compiling details, reinstall the local cuda and solve the installation problem. But after training 80 epoch on mustc-v2 en-zh dataset, the delay loss and prob_loss are still 'nan', PPL and loss are still high (I skip mt distillation step and I use ASR pretraining in the example. ). Is there any possible solution？ @danliu2

It should not be NAN at all. and you can get meaningful nll, which means the backward loss is not nan, or one step backward will turn all params to NAN. So, I guess, it is just some kind of exception where you get NAN and the model is not updated at that time, e.g. the sequence is short than one block and so on. My suggestion :

add nan detector for all loss, and print the id of NAN samples
print current step loss instead of moving average loss.
filter those illegal samples if exists.

danliu2 / caat

Error when training #6