bigscience-workshop / promptsource

Toolkit for creating, sharing and using natural language prompts.
Apache License 2.0
2.65k stars 348 forks source link

downloading error for sent_comp #645

Closed srulikbd closed 2 years ago

srulikbd commented 2 years ago

I'm trying to view sent_comp for that current sprint.. but I get the fllowing error:

NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://github.com/google-research-datasets/sentence-compression/raw/master/data/sent-comp.train03.json.gz'] Traceback: File "/home/srulikbd/.local/lib/python3.7/site-packages/streamlit/script_runner.py", line 338, in _run_script exec(code, module.__dict__) File "/home/srulikbd/promptsource/promptsource/app.py", line 259, in <module> dataset = get_dataset(dataset_key, str(conf_option.name) if conf_option else None) File "/home/srulikbd/.local/lib/python3.7/site-packages/streamlit/caching.py", line 573, in wrapped_func return get_or_create_cached_value() File "/home/srulikbd/.local/lib/python3.7/site-packages/streamlit/caching.py", line 557, in get_or_create_cached_value return_value = func(*args, **kwargs) File "/home/srulikbd/promptsource/promptsource/utils.py", line 49, in get_dataset builder_instance.download_and_prepare() File "/home/srulikbd/.local/lib/python3.7/site-packages/datasets/builder.py", line 608, in download_and_prepare dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs File "/home/srulikbd/.local/lib/python3.7/site-packages/datasets/builder.py", line 680, in _download_and_prepare self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files" File "/home/srulikbd/.local/lib/python3.7/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls))

I'm running the last main promptsource from source, on WSL 2, windows 11, python 3.7. I succeed viewing other datasets easily.

zaidalyafeai commented 2 years ago

Works fine on WSL/Windows 10. it seems this error is related to datasets, not promptsource because there are other people facing similar issues https://github.com/huggingface/datasets/issues/3269

VictorSanh commented 2 years ago

I'm trying to view sent_comp for that current sprint.. but I get the fllowing error:

NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://github.com/google-research-datasets/sentence-compression/raw/master/data/sent-comp.train03.json.gz'] Traceback: File "/home/srulikbd/.local/lib/python3.7/site-packages/streamlit/script_runner.py", line 338, in _run_script exec(code, module.__dict__) File "/home/srulikbd/promptsource/promptsource/app.py", line 259, in <module> dataset = get_dataset(dataset_key, str(conf_option.name) if conf_option else None) File "/home/srulikbd/.local/lib/python3.7/site-packages/streamlit/caching.py", line 573, in wrapped_func return get_or_create_cached_value() File "/home/srulikbd/.local/lib/python3.7/site-packages/streamlit/caching.py", line 557, in get_or_create_cached_value return_value = func(*args, **kwargs) File "/home/srulikbd/promptsource/promptsource/utils.py", line 49, in get_dataset builder_instance.download_and_prepare() File "/home/srulikbd/.local/lib/python3.7/site-packages/datasets/builder.py", line 608, in download_and_prepare dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs File "/home/srulikbd/.local/lib/python3.7/site-packages/datasets/builder.py", line 680, in _download_and_prepare self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files" File "/home/srulikbd/.local/lib/python3.7/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls))

I'm running the last main promptsource from source, on WSL 2, windows 11, python 3.7. I succeed viewing other datasets easily.

i suspect something went wrong during the download: the size of the download does not match its expected value... could you try to remove the cache and re-download?

srulikbd commented 2 years ago

I tried delete and downloading again but the same error appears

VictorSanh commented 2 years ago

i just tried again and couldn't reproduce... could you a lil' more details about your setup?

Could you try a load_dataset("sent_comp", download_mode="force_redownload")?

tianjianjiang commented 2 years ago

I also tried and it worked. Yet, AFAIK, GitHub had many incidents recently. I encountered two different symptoms with c4, but their root cause seems network or file corruption (git-lfs)?

srulikbd commented 2 years ago

ok, with @VictorSanh suggestion it now working! thanks.