efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
259 stars 21 forks source link

issue with `c4` dataset for eval #2

Closed HamidShojanazeri closed 8 months ago

HamidShojanazeri commented 8 months ago

Thanks for the great work guys! Trying to run the W4A4 perplexity evaluation, the HF datasets complains about "ValueError: BuilderConfig 'allenai--c4' not found", so removing allenai--c4 from [datautils.py(https://github.com/efeslab/Atom/blob/main/model/datautils.py#L49) and keeping only 'allenai/c4 let the script to complete the run. Wonder if that would be ok to remove it/ if I am missing something.

Traceback (most recent call last):
  File "/data/home/hamidnazeri/Atom/model/llama.py", line 232, in <module>
    dataloader, testloader = get_loaders(
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 175, in get_loaders
    return get_c4(nsamples, seed, seqlen, model, tokenizer)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 51, in get_c4
    traindata = load_dataset(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']
model,bit,wiki2,ptb,c4,ptb-new,c4-new
/data/home/hamidnazeri/PiPPy/
happierpig commented 8 months ago

@HamidShojanazeri Hi,

Thanks for your interest in Atom!!! About the issue, it's ok to directly remove allenai--c4 in datautils.py, as we fixed in the repo. It seems like outdated usage from old version of datasets. See ref: https://github.com/huggingface/datasets/issues/6559.