foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

bug: `columns` is not a valid keyword argument of `load_dataset` #301

Closed HarikrishnanBalagopal closed 3 months ago

HarikrishnanBalagopal commented 3 months ago

https://github.com/foundation-model-stack/fms-hf-tuning/blob/34362ae61f0a03b3505f0a357aceae7a92ff5304/tuning/config/configs.py#L83

https://github.com/foundation-model-stack/fms-hf-tuning/blob/34362ae61f0a03b3505f0a357aceae7a92ff5304/tuning/config/configs.py#L56-L62

ValueError: BuilderConfig JsonConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['/data/mydataset/train/train.jsonl']}, description=None, features=None, encoding='utf-8', encoding_errors=None, field=None, use_threads=True, block_size=None, chunksize=10485760, newlines_in_values=None) doesn't have a 'columns' key.

https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset

Testing manually

>>> d1=datasets.load_dataset(path=s1, data_files=s, columns=['input'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/u/haribala/code/my-conda-envs/foo/lib/python3.11/site-packages/datasets/load.py", line 2594, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/u/haribala/code/my-conda-envs/foo/lib/python3.11/site-packages/datasets/load.py", line 2303, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/u/haribala/code/my-conda-envs/foo/lib/python3.11/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/haribala/code/my-conda-envs/foo/lib/python3.11/site-packages/datasets/builder.py", line 622, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig JsonConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['/data/mydataset/train/train.jsonl']}, description=None, features=None, encoding='utf-8', encoding_errors=None, field=None, use_threads=True, block_size=None, chunksize=10485760, newlines_in_values=None) doesn't have a 'columns' key.
kmehant commented 3 months ago

we are handling it gracefully and loading the dataset without config_kwargs when it fails. This way we no need to have type specific handling. we simply fallback when such happens.