allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.48k stars 449 forks source link

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory #563

Open Jimmy-Yang1217 opened 5 months ago

Jimmy-Yang1217 commented 5 months ago

🐛 Describe the bug

I am new to OLMo and I want to retrain(like finetune) several checkpoints provided by the csv from checkpoints/official. `` However, I followed the instructions in readme and downloaded the checkpoint via the link, but the 'RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory' always throws out.

According to some solution to this kind of questions from Stackoverflow, they pointed out it might caused by the corrupted checkpoint file or wrong torch version.I changed different checkpoints and varied torch version from 2.0.0 to 2.3.0, but the error is still there. Also, the checkpoints download progress seems done, reaching 100%, so the ckpt files should not be corrupted.

Here is my terminal command: torchrun --nproc_per_node=1 scripts/train.py configs/official/OLMo-1B.yaml --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/step369000-unsharded --save_folder=/opt/data/private/OLMo/olmo/step369000 --wandb=null

AND THE ERROR: RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Versions

Python 3.8.10 accelerate==0.25.0 ai2-olmo==0.3.1 aiofiles==23.2.1 aiohttp==3.8.6 aiosignal==1.3.1 albumentations==1.3.1 altair==5.1.2 annotated-types==0.6.0 antlr4-python3-runtime==4.9.3 anyio==4.0.0 apache-beam==2.55.1 async-timeout==4.0.3 attrs==23.1.0 backports.zoneinfo==0.2.1 beautifulsoup4==4.12.3 bitsandbytes==0.41.3.post2 boto3==1.34.93 botocore==1.34.93 braceexpand==0.1.7 cached-path==1.6.2 cachetools==5.3.3 certifi==2019.11.28 chardet==3.0.4 charset-normalizer==3.3.1 click==8.1.7 clip==0.2.0 clip-benchmark==1.5.0 cloudpickle==2.2.1 cmake==3.27.7 contourpy==1.1.1 crcmod==1.7 cycler==0.12.1 dataclasses==0.6 datasets==2.14.5 dbus-python==1.2.16 dill==0.3.1.1 dnspython==2.6.1 docker-pycreds==0.4.0 docopt==0.6.2 exceptiongroup==1.1.3 ExifRead-nocycle==3.0.1 fastapi==0.104.1 fastavro==1.9.4 fasteners==0.19 ffmpy==0.3.1 filelock==3.12.4 fire==0.4.0 fonttools==4.44.0 frozenlist==1.4.0 fsspec==2023.9.2 ftfy==6.1.1 gdown==5.1.0 gitdb==4.0.11 GitPython==3.1.40 google-api-core==2.18.0 google-auth==2.29.0 google-cloud-core==2.4.1 google-cloud-storage==2.16.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.63.0 gradio==3.39.0 gradio-client==0.7.0 grpcio==1.62.2 h11==0.14.0 hdfs==2.7.3 httpcore==1.0.2 httplib2==0.22.0 httpx==0.25.1 huggingface-hub==0.22.2 idna==2.8 imageio==2.31.6 img2dataset==1.42.0 importlib-resources==6.1.1 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 Js2Py==0.74 jsonpickle==3.0.4 jsonschema==4.20.0 jsonschema-specifications==2023.11.1 kiwisolver==1.4.5 lazy-loader==0.3 lightning-utilities==0.11.2 linkify-it-py==2.0.2 lit==17.0.3 loguru==0.7.2 loralib==0.1.2 markdown-it-py==3.0.0 MarkupSafe==2.1.3 matplotlib==3.7.3 mdit-py-plugins==0.3.3 mdurl==0.1.2 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.15 networkx==3.1 numpy==1.21.0 nvidia-cublas-cu11==11.10.3.66 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu11==8.5.0.96 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu11==10.9.0.58 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu11==10.2.10.91 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu11==11.7.4.91 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu11==2.14.3 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu11==11.7.91 nvidia-nvtx-cu12==12.1.105 objsize==0.7.0 omegaconf==2.3.0 open-clip-torch==2.23.0 openai-clip==1.0.1 opencv-python-headless==4.8.1.78 orjson==3.9.10 packaging==23.2 pandas==1.5.3 pathtools==0.1.2 peft==0.7.1 Pillow==9.1.1 pkgutil-resolve-name==1.3.10 promise==2.3 proto-plus==1.23.0 protobuf==3.20.3 psutil==5.9.6 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1-modules==0.4.0 pycocoevalcap==1.2 pycocotools==2.0.7 pydantic==2.5.1 pydantic-core==2.14.3 pydot==1.4.2 pydub==0.25.1 pygments==2.17.2 PyGObject==3.36.0 pyjsparser==2.7.1 pymongo==4.7.0 pyparsing==3.1.1 PySocks==1.7.1 python-apt==2.0.0+ubuntu0.20.4.7 python-dateutil==2.8.2 python-multipart==0.0.6 pytz==2023.3.post1 PyWavelets==1.4.1 PyYAML==6.0.1 quant-cuda==0.0.0 qudida==0.0.4 referencing==0.31.0 regex==2023.10.3 requests==2.31.0 requests-unixsocket==0.2.0 rich==13.7.1 rpds-py==0.13.0 rsa==4.9 s3transfer==0.10.1 safetensors==0.4.3 scikit-image==0.21.0 scikit-learn==1.3.2 scipy==1.10.1 semantic-version==2.10.0 sentencepiece==0.1.99 sentry-sdk==1.33.1 setproctitle==1.3.3 shortuuid==1.0.11 six==1.14.0 smmap==5.0.1 sniffio==1.3.0 soupsieve==2.5 starlette==0.27.0 sympy==1.12 termcolor==2.3.0 threadpoolctl==3.2.0 tifffile==2023.7.10 timm==0.9.10 tokenizers==0.19.1 toolz==0.12.0 torch==2.0.0 torch-summary==1.4.5 torchaudio==0.9.0 torchmetrics==1.3.2 torchvision==0.15.2+cu117 tqdm==4.66.1 transformers==4.40.1 triton==2.0.0 typing-extensions==4.11.0 tzdata==2023.3 tzlocal==5.2 uc-micro-py==1.0.2 urllib3==1.26.18 uvicorn==0.24.0.post1 wandb==0.12.21 wcwidth==0.2.8 webdataset==0.2.72 websockets==11.0.3 xxhash==3.4.1 yarl==1.9.2 zipp==3.17.0 zstandard==0.22.0

dumitrac commented 5 months ago

@Jimmy-Yang1217 - could you please include the log before the error occurs? I'm curious when exactly the error is thrown. Thank you!

Jimmy-Yang1217 commented 4 months ago

Hi, I have already tackled this problem by finding out that some of the download files were broken and the solution is going to the cache to ensure the specific download ckpt file is complete. Here comes another problem I want your guys help. Due to the checkpoint models are different from the released hf-OLMo, if I want to evaluate the checkpoint model on lm-evaluation-harness, I found it hard to directly evaluate on lm_eval for there is no suitable model form the checkpoint of OLMo can fit in. So how you guys evaluate the checkpoint of OLMo on the downstream tasks? Is there any fast way? Thanks for your time!

------------------ 原始邮件 ------------------ 发件人: "Constantin @.>; 发送时间: 2024年5月1日(星期三) 上午6:48 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [allenai/OLMo] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory (Issue #563)

@Jimmy-Yang1217 - could you please include the log before the error occurs? I'm curious when exactly the error is thrown. Thank you!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>