allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
1.02k stars 108 forks source link

after pip install dolma, use the dolma--help, return zipfile.BadZipFile: File is not a zip file #201

Closed XiaozhuLove closed 2 months ago

XiaozhuLove commented 2 months ago
(dolma) whzhu_st@PowerEdge-R740:~$ dolma --help
Traceback (most recent call last):
  File "/home/whzhu_st/anaconda3/envs/dolma/bin/dolma", line 5, in <module>
    from dolma.cli.__main__ import main
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/dolma/__init__.py", line 15, in <module>
    from .taggers import *  # noqa: E402
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/dolma/taggers/__init__.py", line 1, in <module>
    from . import (
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/dolma/taggers/jigsaw.py", line 12, in <module>
    from ..core.ft_tagger import BaseFastTextTagger, Prediction
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/dolma/core/ft_tagger.py", line 20, in <module>
    from .utils import split_paragraphs, split_sentences
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/dolma/core/utils.py", line 25, in <module>
    nltk.data.find("tokenizers/punkt")
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/nltk/data.py", line 551, in find
    return find(modified_name, paths)
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/nltk/data.py", line 538, in find
    return ZipFilePathPointer(p, zipentry)
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/nltk/data.py", line 391, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/site-packages/nltk/data.py", line 1020, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/zipfile.py", line 1258, in __init__
    self._RealGetContents()
  File "/home/whzhu_st/anaconda3/envs/dolma/lib/python3.10/zipfile.py", line 1325, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
(dolma) whzhu_st@PowerEdge-R740:~$

Looking forward to your reply. Thank you very much!!!

soldni commented 2 months ago

could you list dependencies in your environment? pip list 2>&1 > dependencies.txt

XiaozhuLove commented 2 months ago

pip list 2>&1 > dependencies.txt: `Package Version


aiobotocore 2.15.1 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aioitertools 0.12.0 aiosignal 1.3.1 antlr4-python3-runtime 4.9.3 anyascii 0.3.2 async-timeout 4.0.3 attrs 24.2.0 blingfire 0.1.8 boto3 1.35.23 botocore 1.35.23 cached_path 1.6.3 cachetools 5.5.0 certifi 2024.8.30 charset-normalizer 3.3.2 click 8.1.7 dolma 1.0.12 fasttext-wheel 0.9.2 filelock 3.13.4 frozenlist 1.4.1 fsspec 2024.9.0 google-api-core 2.20.0 google-auth 2.35.0 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 huggingface-hub 0.23.5 idna 3.10 jmespath 1.0.1 joblib 1.4.2 markdown-it-py 3.0.0 mdurl 0.1.2 msgspec 0.18.6 multidict 6.1.0 necessary 0.4.3 nltk 3.9.1 numpy 1.26.4 omegaconf 2.3.0 packaging 24.1 pip 24.2 platformdirs 4.3.6 proto-plus 1.24.0 protobuf 5.28.2 pyasn1 0.6.1 pyasn1_modules 0.4.1 pybind11 2.13.6 Pygments 2.18.0 python-dateutil 2.9.0.post0 PyYAML 6.0.2 regex 2024.9.11 requests 2.32.3 requirements-parser 0.11.0 rich 13.8.1 rsa 4.9 s3fs 2024.9.0 s3transfer 0.10.2 setuptools 75.1.0 six 1.16.0 smart-open 7.0.4 tokenizers 0.19.1 tqdm 4.66.5 types-setuptools 75.1.0.20240917 typing_extensions 4.12.2 uniseg 0.8.1 urllib3 2.2.3 wheel 0.44.0 wrapt 1.16.0 yarl 1.11.1 zstandard 0.23.0 ` Thank you very much!!!

XiaozhuLove commented 2 months ago

Hello, is it a problem with the installation of the dependent library version? I created a new virtual environment and directly used pip install dolma, so there shouldn't be any issues, right? Looking forward to your reply, thank you.