Issues with saving checkpoints

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

12.21k stars 2.54k forks source link

Issues with saving checkpoints #3181

Closed maevdokimov closed 2 years ago

maevdokimov commented 3 years ago

Hi! I'm trying to train tacotron 2 from scratch.

Running in current main branch with pytorch-lightning==1.5.0

PYTHONPATH="$(pwd)" python examples/tts/tacotron2.py \
train_dataset=train_data.json \
validation_datasets=validation_data.json \
+exp_manager.checkpoint_callback_params.save_top_k=1 \
+exp_manager.checkpoint_callback_params.monitor=loss

Results in correct training without checkpointing. Downgrading to pytorch-lightning==1.4.2 fixes this issue.

blisc commented 3 years ago

Is there any reason you are tracking training loss as opposed to validation loss? Do you run your experiments for enough epochs such that the validation happens at least once? Can you run with pytorch-lightning==1.5.1 as well?

Can you upload lightning_logs.txt from your experiment directory?