NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment
Apache License 2.0
419 stars 44 forks source link

Multiple training file support #207

Open seanliu96 opened 2 weeks ago

seanliu96 commented 2 weeks ago

Describe the bug

An exception is raised when trying to use model.data.train_ds.file_names to provide multiple data files in SFT and DPO.

Steps/Code to reproduce bug

When setting the config model.data.train_ds.file_names with multiple training files rather than using file_path, an exception is raised because https://github.com/NVIDIA/NeMo-Aligner/blob/main/nemo_aligner/data/nlp/builders.py#L267 only consider file_path and assume it is not None.

Expected behavior

The build_sft_dataset and other similar functions should detect whether cfg.file_names is specified and then build datasets.

seanliu96 commented 2 weeks ago

Sorry for the wrong label. It should be a feature.