allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.71k stars 2.24k forks source link

error message occuied “zipfile.BadZipFile: File is not a zip file” #5710

Closed Hoyyyaard closed 1 year ago

Hoyyyaard commented 1 year ago

Description

error message occuied “zipfile.BadZipFile: File is not a zip file” When applied the code:

from transformers import PreTrainedTokenizerBase, AutoTokenizer from allennlp.predictors.predictor import Predictor from allennlp.common import Params

from allennlp.common.model_card import ModelCard import tarfile from pathlib import Path from typing import List, Union, Dict, Iterable, Sequence from allennlp.common.plugins import import_plugins predictor = Predictor.from_path('./elmo-constituency-parser-2020.02.10.tar.gz')

elmo-constituency-parser-2020.02.10.tar.gz has already been downloaded about 678M ## Environment OS: Linux Python version: 3.8 ``` absl-py 1.0.0 aiohttp 3.8.1 aiosignal 1.2.0 allennlp 2.9.1 allennlp-models 2.9.0 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.8 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.1.0 autocommand 2.2.1 backcall 0.2.0 backports.csv 1.0.7 base58 2.1.1 beautifulsoup4 4.11.1 bleach 5.0.1 blis 0.7.8 boto3 1.24.69 botocore 1.27.69 cached-path 1.1.5 cachetools 5.0.0 catalogue 2.0.8 certifi 2022.6.15 cffi 1.15.1 charset-normalizer 2.1.0 checklist 0.0.11 cheroot 8.6.0 CherryPy 18.8.0 click 8.0.4 clip 1.0 commonmark 0.9.1 conllu 4.4.1 cryptography 38.0.1 cycler 0.11.0 cymem 2.0.6 datasets 1.18.4 debugpy 1.6.3 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.5.1 docker-pycreds 0.4.0 easydict 1.9 EasyProcess 1.1 entrypoints 0.4 et-xmlfile 1.1.0 executing 0.10.0 fairscale 0.4.5 fastjsonschema 2.16.1 feedparser 6.0.10 filelock 3.6.0 flatbuffers 2.0 fonttools 4.34.4 frozenlist 1.3.1 fsspec 2022.8.2 ftfy 6.1.1 future 0.18.2 gitdb 4.0.9 GitPython 3.1.27 google-api-core 2.10.0 google-auth 2.6.5 google-auth-oauthlib 0.4.6 google-cloud-core 2.3.2 google-cloud-storage 2.5.0 google-crc32c 1.5.0 google-pasta 0.2.0 google-resumable-media 2.3.3 googleapis-common-protos 1.56.4 grpcio 1.44.0 gym 0.10.9 h5py 2.10.0 huggingface-hub 0.8.1 idna 3.3 importlib-metadata 4.11.3 importlib-resources 5.9.0 inflect 6.0.0 iniconfig 1.1.1 ipykernel 6.15.1 ipython 8.4.0 ipython-genutils 0.2.0 ipywidgets 8.0.2 iso-639 0.4.5 jaraco.classes 3.2.2 jaraco.collections 3.5.2 jaraco.context 4.1.2 jaraco.functools 3.5.1 jaraco.text 3.9.1 jedi 0.18.1 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.1.0 json5 0.9.10 jsonlines 2.0.0 jsonnet 0.18.0 jsonschema 4.15.0 jupyter 1.0.0 jupyter-client 7.3.4 jupyter-console 6.4.4 jupyter-core 4.11.1 jupyterlab-pygments 0.2.2 jupyterlab-widgets 3.0.3 keras 2.8.0 Keras-Preprocessing 1.1.2 kiwisolver 1.4.2 langcodes 3.3.0 libclang 13.0.0 lmdb 1.3.0 lxml 4.9.1 Markdown 3.3.6 MarkupSafe 2.1.1 matplotlib 3.5.2 matplotlib-inline 0.1.6 mistune 2.0.4 more-itertools 8.14.0 multidict 6.0.2 multiprocess 0.70.13 munch 2.5.0 murmurhash 1.0.8 nbclient 0.6.7 nbconvert 7.0.0 nbformat 5.4.0 nest-asyncio 1.5.5 networkx 2.5.1 nltk 3.6.7 nn-builder 1.0.5 notebook 6.4.12 numpy 1.20.3 oauthlib 3.2.0 openai 0.20.0 opencv-python 4.6.0.66 openpyxl 3.0.10 opt-einsum 3.3.0 packaging 21.3 pandas 1.4.3 pandas-stubs 1.4.3.220704 pandocfilters 1.5.0 parso 0.8.3 path 16.4.0 pathtools 0.1.2 pathy 0.6.2 patternfork-nosql 3.6 pdfminer.six 20220524 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.1.1 pip 21.2.4 pkgutil_resolve_name 1.3.10 pluggy 1.0.0 portend 3.1.0 preshed 3.0.7 prometheus-client 0.14.1 promise 2.3 prompt-toolkit 3.0.20 protobuf 3.20.1 psutil 5.9.2 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 py-rouge 1.1 pyarrow 9.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.21 pydantic 1.8.2 pyglet 1.5.23 Pygments 2.13.0 pyparsing 3.0.9 pyrsistent 0.18.1 pytest 7.1.3 python-dateutil 2.8.2 python-docx 0.8.11 pytz 2022.1 PyVirtualDisplay 0.2.1 PyYAML 6.0 pyzmq 23.2.1 qtconsole 5.3.2 QtPy 2.2.0 regex 2022.6.2 requests 2.28.1 requests-oauthlib 1.3.1 responses 0.18.0 rich 12.5.1 rsa 4.8 s3transfer 0.6.0 sacremoses 0.0.53 scikit-learn 1.1.2 scipy 1.8.0 Send2Trash 1.8.0 sentencepiece 0.1.97 sentry-sdk 1.9.8 setproctitle 1.3.2 setuptools 63.4.1 sgmllib3k 1.0.0 Shapely 1.7.1 shortuuid 1.0.9 six 1.16.0 smart-open 5.2.1 smmap 5.0.0 soupsieve 2.3.2.post1 spacy 3.2.4 spacy-legacy 3.0.10 spacy-loggers 1.0.3 sqlitedict 2.0.0 srsly 2.4.4 stack-data 0.4.0 stanfordcorenlp 3.9.1.1 tempora 5.0.2 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorboardX 2.4.1 tensorflow-io-gcs-filesystem 0.24.0 termcolor 1.1.0 terminado 0.15.0 tf-estimator-nightly 2.8.0.dev2021122109 thinc 8.0.17 threadpoolctl 3.1.0 tinycss2 1.1.1 tokenizers 0.10.3 tomli 2.0.1 torch 1.8.1+cu101 torchaudio 0.8.1 torchvision 0.9.1+cu101 tornado 6.2 tqdm 4.64.1 traitlets 5.3.0 transformers 4.12.5 typer 0.4.2 typing_extensions 4.1.1 urllib3 1.26.12 wandb 0.12.21 wasabi 0.10.1 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 2.1.1 wheel 0.37.1 widgetsnbextension 4.0.3 word2number 1.1 wrapt 1.14.0 xxhash 3.0.0 yarl 1.8.1 zc.lockfile 2.0 zipp 3.8.0 ```
Hoyyyaard commented 1 year ago

Traceback (most recent call last): File "debug.py", line 15, in predictor = Predictor.from_path('/mnt/cephfs/home/zhihongyan/VLN/DUET/datasets/elmo') File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/predictors/predictor.py", line 364, in from_path plugins.import_plugins() File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/common/plugins.py", line 79, in import_plugins import_module_and_submodules( File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/common/util.py", line 362, in import_module_and_submodules import_module_and_submodules(subpackage, exclude=exclude) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/common/util.py", line 351, in import_module_and_submodules module = importlib.import_module(package_name) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 843, in exec_module File "", line 219, in _call_with_frames_removed File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/commands/init.py", line 32, in from allennlp.commands.checklist import CheckList File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/commands/checklist.py", line 22, in from allennlp.confidence_checks.task_checklists.task_suite import TaskSuite File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/confidence_checks/task_checklists/init.py", line 4, in from allennlp.confidence_checks.task_checklists.task_suite import TaskSuite File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/allennlp/confidence_checks/task_checklists/task_suite.py", line 9, in from checklist.perturb import Perturb File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/checklist/perturb.py", line 7, in from pattern.en import tenses File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/pattern/text/en/init.py", line 61, in from pattern.text.en.inflect import ( File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/pattern/text/en/init.py", line 80, in from pattern.text.en import wordnet File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/pattern/text/en/wordnet/init.py", line 57, in nltk.data.find("corpora/" + token) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/nltk/data.py", line 555, in find return find(modified_name, paths) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/nltk/data.py", line 542, in find return ZipFilePathPointer(p, zipentry) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator return init_func(*args, *kwargs) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/nltk/data.py", line 394, in init zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile)) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator return init_func(args, **kwargs) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/site-packages/nltk/data.py", line 935, in init zipfile.ZipFile.init(self, filename) File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/zipfile.py", line 1269, in init self._RealGetContents() File "/mnt/cephfs/home/zhihongyan/anaconda3/envs/vlnduet_probe/lib/python3.8/zipfile.py", line 1336, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

epwalsh commented 1 year ago

I believe we've encountered this before. It's an issue with race conditions in NLTK when it tries to download the same corpora from two processes at the same time. You end up with a corrupted download. Try this:

# Remove existing corrupted NLTK downloads
rm -rf ~/nltk_data
# Re-download everything
python -c 'import nltk; [nltk.download(p) for p in ("wordnet", "wordnet_ic", "sentiwordnet", "omw", "omw-1.4")]'
github-actions[bot] commented 1 year ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇