NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

Configuring datasets using train-data-path, valid-data-path, and test-data-path results in training errors #841

Open Eisenhower opened 1 month ago

Eisenhower commented 1 month ago

Describe the bug When I configure datasets for a training task using train-data-path, valid-data-path, and test-data-path, running the training task results in an error. The error message is shown in the screenshot below:

File "/home/kas/kas_workspace/dataset/zrh/pai-megatron-patch/Pai-Megatron-Patch/Megatron-LM-240405/megatron/core/datasets/blended_megatron_dataset_config.py", line 72, in __post_init__ assert self.split is None, "split and blend_per_split are incompatible" AssertionError assert self.split is None, "split and blend_per_split are incompatible": split and blend_per_split are incompatible

To Reproduce Configure training datasets using train-data-path, valid-data-path, and test-data-path.

Expected behavior Enable configuring datasets using train-data-path, valid-data-path, and test-data-path

Stack trace/logs File "/home/kas/kas_workspace/dataset/zrh/pai-megatron-patch/Pai-Megatron-Patch/Megatron-LM-240405/megatron/core/datasets/blended_megatron_dataset_config.py", line 72, in __post_init__ assert self.split is None, "split and blend_per_split are incompatible" AssertionError assert self.split is None, "split and blend_per_split are incompatible": split and blend_per_split are incompatible

Environment (please complete the following information):

Proposed fix https://github.com/NVIDIA/Megatron-LM/pull/840

Additional context Add any other context about the problem here.