NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment
Apache License 2.0
628 stars 78 forks source link

SFT is broken with container 24.01.01 #131

Open odelalleau opened 8 months ago

odelalleau commented 8 months ago

Describe the bug

A user reported a crash with 24.01.01 and SFT (while things work fine with 24.01):

File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 215, in main
    init_using_ptl(trainer, ptl_model, train_dataloader, train_ds)
  File "/opt/NeMo-Aligner/nemo_aligner/utils/train_script_utils.py", line 103, in init_using_ptl
    call._call_setup_hook(ptl_trainer)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 86, in _call_setup_hook
    _call_lightning_module_hook(trainer, "setup", stage=fn)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 145, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1372, in setup
    self._reconfigure_val_batches()
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 340, in _reconfigure_val_batches
    val_len_in_micro_batches = len(self._validation_dl)
TypeError: object of type 'NoneType' has no len()
AtsunoriFujita commented 7 months ago

Thank you. I tested on 24.03 container and It has been fixed.