NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.53k stars 3.23k forks source link

[FastPitch/PyTorch] train.py fails with excpetion (resaon: old DLLogger version in container) #1231

Open itakatz opened 1 year ago

itakatz commented 1 year ago

Related to FastPitch/PyTorch

Describe the bug When running train.py via the README recipe, the code fails with exception:

Traceback (most recent call last):
    File "train.py", line 559, in <module>
      main()
    File "train.py", line 306, in main
      logger.init(log_fpath, args.output, enabled=(args.local_rank == 0),
    File "/workspace/fastpitch/FastPitch/common/tb_dllogger.py", line 90, in init
      JSONStreamBackend(Verbosity.DEFAULT, log_fpath, append=True),
TypeError: __init__() got an unexpected keyword argument 'append'   

The reason (+solution) The container is installed with DLLogger version 0.1, and the JSONStreamBackend constructor doesn't have the "append" input argument. I had to upgrade DLLogger to latest version (1.0) from the git repo:

pip uninstall DLLogger
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger

After this upgrade of DLLogger, it works ok.

To Reproduce Steps to reproduce the behavior:

  1. Follow the instructions of the README.md file under "FastPitch#quick-start-guide"
  2. run bash scripts/train.sh

Expected behavior The training process should start running.

Environment

itakatz commented 1 year ago

I believe this can also be solved by updating the docker image version in Dockerfile from 21.05 to 21.12, but I did not try it.