HazyResearch / m2

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"
Apache License 2.0
532 stars 43 forks source link

Unable to use 'convert_dataset.py' to load data #36

Open sandeep-krutrim opened 2 months ago

sandeep-krutrim commented 2 months ago

I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset. If I do, stream=False in the code, then i get the following error -

Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00, 4.55s/files] Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in <module> main(parse_args()) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main loader = build_dataloader(dataset=dataset, batch_size=512) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type: ignore AttributeError: 'Dataset' object has no attribute 'n_shards'

Please help to resolve this as I am stucked on reproducing the training pipeline.

DanFu09 commented 2 months ago

This seems like a change in HuggingFace API version. What version of HuggingFace transformers are you using?

On Fri, Jun 28, 2024 at 12:07 AM sandeep-krutrim @.***> wrote:

I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset. If I do, stream=False in the code, then i get the following error -

Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00, 4.55s/files] Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s] Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s] Traceback (most recent call last): File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in

main(parse_args()) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main loader = build_dataloader(dataset=dataset, batch_size=512) File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader num_workers = min(64, dataset.hf_dataset.n_shards) # type: ignore AttributeError: 'Dataset' object has no attribute 'n_shards' Please help to resolve this as I am stucked on reproducing the training pipeline. — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>