allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.74k stars 2.25k forks source link

Can't load models with .zip extension #5718

Closed serenalotreck closed 1 year ago

serenalotreck commented 1 year ago

Checklist

Description

When using the Python API to import a pretrained model that has the extension .zip, fails due to "r:gz" specified here. Models in question are from the PURE repository.

Python traceback:

``` --------------------------------------------------------------------------- OSError Traceback (most recent call last) ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs) 1645 try: -> 1646 t = cls.taropen(name, mode, fileobj, **kwargs) 1647 except OSError: ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in taropen(cls, name, mode, fileobj, **kwargs) 1622 raise ValueError("mode must be 'r', 'a', 'w' or 'x'") -> 1623 return cls(name, mode, fileobj, **kwargs) 1624 ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in __init__(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize) 1485 self.firstmember = None -> 1486 self.firstmember = self.next() 1487 ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in next(self) 2288 try: -> 2289 tarinfo = self.tarinfo.fromtarfile(self) 2290 except EOFHeaderError as e: ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in fromtarfile(cls, tarfile) 1093 """ -> 1094 buf = tarfile.fileobj.read(BLOCKSIZE) 1095 obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors) ~/anaconda3/envs/dygiepp/lib/python3.7/gzip.py in read(self, size) 286 raise OSError(errno.EBADF, "read() on write-only GzipFile object") --> 287 return self._buffer.read(size) 288 ~/anaconda3/envs/dygiepp/lib/python3.7/_compression.py in readinto(self, b) 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data ~/anaconda3/envs/dygiepp/lib/python3.7/gzip.py in read(self, size) 473 self._init_read() --> 474 if not self._read_gzip_header(): 475 self._size = self._pos ~/anaconda3/envs/dygiepp/lib/python3.7/gzip.py in _read_gzip_header(self) 421 if magic != b'\037\213': --> 422 raise OSError('Not a gzipped file (%r)' % magic) 423 OSError: Not a gzipped file (b'PK') During handling of the above exception, another exception occurred: ReadError Traceback (most recent call last) /tmp/local/63761026/ipykernel_27406/2934808558.py in ----> 1 scierc_pure = Model.from_archive('https://nlp.cs.princeton.edu/projects/pure/scierc_models/ent-scib-ctx300.zip') ~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/model.py in from_archive(cls, archive_file, vocab) 480 from allennlp.models.archival import load_archive # here to avoid circular imports 481 --> 482 model = load_archive(archive_file).model 483 if vocab: 484 model.vocab.extend_from_vocab(vocab) ~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py in load_archive(archive_file, cuda_device, overrides, weights_file) 218 serialization_dir = resolved_archive_file 219 else: --> 220 with extracted_archive(resolved_archive_file, cleanup=False) as tempdir: 221 serialization_dir = tempdir 222 ~/anaconda3/envs/dygiepp/lib/python3.7/contextlib.py in __enter__(self) 110 del self.args, self.kwds, self.func 111 try: --> 112 return next(self.gen) 113 except StopIteration: 114 raise RuntimeError("generator didn't yield") from None ~/anaconda3/envs/dygiepp/lib/python3.7/site-packages/allennlp/models/archival.py in extracted_archive(resolved_archive_file, cleanup) 299 tempdir = tempfile.mkdtemp() 300 logger.info(f"extracting archive file {resolved_archive_file} to temp dir {tempdir}") --> 301 with tarfile.open(resolved_archive_file, "r:gz") as archive: 302 archive.extractall(tempdir) 303 yield tempdir ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs) 1591 else: 1592 raise CompressionError("unknown compression type %r" % comptype) -> 1593 return func(name, filemode, fileobj, **kwargs) 1594 1595 elif "|" in mode: ~/anaconda3/envs/dygiepp/lib/python3.7/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs) 1648 fileobj.close() 1649 if mode == 'r': -> 1650 raise ReadError("not a gzip file") 1651 raise 1652 except: ReadError: not a gzip file ```

Related issues or possible duplicates

Environment

OS: CentOS Linux release 7.9.2009 Linux Kernel 3.10.0-1160.36.2.el7.x86_64

Python version: Python 3.6.4

Output of pip freeze:

``` absl-py==0.5.0 alabaster==0.7.12 alembic==1.7.5 allennlp==1.1.0 allennlp-models==1.1.0 anyio==3.6.1 appdirs==1.4.3 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 artemis==0.1.4 ase==3.17.0 asn1crypto==0.24.0 astor==0.7.1 atomicwrites==1.3.0 attrs==19.1.0 autopage==0.4.0 awscli==1.18.75 Babel==2.9.1 backcall==0.1.0 backports.csv==1.0.7 bcrypt==3.1.4 beartype==0.3.2 beautifulsoup4==4.8.1 biom-format==2.1.7 biopython==1.72 bitstring==3.1.5 bleach==3.1.4 blessings==1.7 blis==0.4.1 blist==1.3.6 bokeh==2.3.2 BoltzTraP2==20.7.1 boto3==1.20.20 botocore==1.23.20 brewer2mpl==1.4.1 BUSCO==3.1.0 bz2file==0.98 cachetools==4.2.2 catalogue==1.0.0 certifi==2020.6.20 cffi==1.14.3 cftime==1.3.1 chardet==3.0.4 charset-normalizer==2.0.12 checksumdir==1.2.0 cli-helpers==0.2.3 click==7.1.2 cliff==3.10.0 cmaes==0.8.2 cmd2==2.3.3 colorama==0.4.3 colored==1.4.2 colorlog==6.6.0 commonmark==0.9.1 configobj==5.0.6 conllu==4.1 contextvars==2.4 cryptography==2.1.4 css-html-js-minify==2.5.5 cupy-cuda100==6.0.0 cutadapt==3.7 cycler==0.10.0 cymem==2.0.3 Cython==0.27.3 dataclasses==0.8 deap==1.2.2 decorator==4.4.2 deepTools==3.1.3 defusedxml==0.6.0 deprecation==2.1.0 dnaio==0.7.1 docopt==0.6.2 docutils==0.14 doqu==0.28.2 ecdsa==0.13 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz entrypoints==0.3 enum34==1.1.10 fastrlock==0.4 filelock==3.0.12 Flask==1.0.3 Forthon==0.8.49 frozenlist==1.2.0 ftfy==6.0.3 future==0.16.0 gast==0.2.0 ghp-import==2.0.2 git-filter-repo==2.34.0 gitdb==4.0.9 GitPython==3.1.18 globus-cli==2.1.0 globus-sdk==2.0.1 google-auth==1.32.1 gpustat==0.5.0 greenlet==1.1.2 GridDataFormats==0.5.0 grpcio==1.15.0 gudhi==3.4.1.post1 gurobipy==8.0.1 h5py==2.8.0 HTSeq==0.12.4 huggingface-hub==0.4.0 humanize==3.4.1 idna==2.10 idr==2.0.2 imagesize==1.1.0 iminuit==2.4.0 immutables==0.19 importlib-metadata==4.8.2 importlib-resources==5.4.0 intervaltree==3.0.2 ipykernel==5.2.1 ipython==7.2.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isal==0.11.1 itsdangerous==1.1.0 jedi==0.13.1 Jinja2==3.0.3 jmespath==0.10.0 joblib==0.17.0 json5==0.9.10 jsonnet==0.17.0 jsonpickle==2.0.0 jsonschema==3.2.0 jupyter==1.0.0 jupyter-client==7.0.6 jupyter-console==6.1.0 jupyter-core==4.8.1 jupyter-server==1.13.1 jupyterlab==3.2.9 jupyterlab-server==2.10.3 kaleido==0.0.3.post1 kiwisolver==1.0.1 lapels==1.1.1 liac-arff==2.1.1 llvmlite==0.36.0 lxml==4.6.4 Mako==1.1.6 mappy==2.20 Markdown==3.3.6 MarkupSafe==2.0.1 matlabengineforpython===R2019b matplotlib==3.3.2 matplotlib-venn==0.11.5 mergedeep==1.3.4 MicrobeCensus==1.1.0 misopy==0.5.4 mistune==0.8.4 mkdocs==1.2.3 mkdocs-git-revision-date-localized-plugin==0.11.1 mkdocs-material==8.1.7 mkdocs-material-extensions==1.0.3 mkdocs-redirects==1.0.4 mmtf-python==1.1.2 mock==2.0.0 modtools==1.0.2 more-itertools==6.0.0 mpi4py==3.0.0 msgpack==1.0.0 multidict==5.2.0 murmurhash==1.0.2 NanoComp==1.15.0 NanoFilt==2.7.1 nanoget==1.15.0 NanoLyse==1.2.0 nanomath==1.0.1 nanopack==1.1.0 NanoPlot==1.38.0 nanoQC==0.9.4 NanoStat==1.5.0 natsort==7.1.1 nbclassic==0.3.5 nbconvert==5.6.1 nbformat==4.4.0 ncbi-genome-download==0.2.9 nest-asyncio==1.5.1 netaddr==0.7.19 netCDF4==1.5.5.1 netifaces==0.10.6 networkx==2.5 nltk==3.6.5 nmslib==2.0.6 nose==1.3.7 notebook==6.0.3 numba==0.53.1 numpy==1.19.2 numpydoc==0.8.0 nvidia-ml-py3==7.352.0 oauthlib==3.1.1 opencv-python==4.4.0.44 optuna==2.10.0 overrides==3.1.0 packaging==21.3 pandas==0.25.2 pandocfilters==1.4.2 paramiko==2.4.0 parso==0.3.1 pauvre==0.2 paycheck==1.0.2 pbr==3.1.1 pexpect==4.6.0 pickleshare==0.7.5 Pillow==7.2.0 plac==1.1.3 plotly==4.10.0 pluggy==0.9.0 preshed==3.0.2 prettytable==2.4.0 prometheus-client==0.7.1 prompt-toolkit==2.0.7 protobuf==3.17.3 psutil==5.6.1 ptyprocess==0.6.0 py==1.8.0 py-enigma==0.1 py-rouge==1.1 py2bit==0.3.0 pyarrow==1.0.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pybedtools==0.8.0 pyBigWig==0.3.12 pybind11==2.5.0 pycairo==1.19.1 pycparser==2.20 pycrypto==2.6.1 pygist==2.2.0 Pygments==2.11.2 PyJWT==1.7.1 pymdown-extensions==9.1 PyNaCl==1.2.1 pyparsing==2.2.0 pyperclip==1.8.2 pyrsistent==0.18.0 pysam==0.15.1 pysbd==0.2.3 pytest==4.3.1 python-dateutil==2.8.1 Python-Deprecated==1.1.0 python-igraph==0.8.0 python-Levenshtein==0.12.2 python-magic==0.4.18 pytz==2017.3 PyYAML==5.3.1 pyyaml_env_tag==0.1 pyzmq==19.0.0 qtconsole==4.7.3 QtPy==1.9.0 regex==2021.11.10 requests==2.25.1 requests-oauthlib==1.3.0 retrying==1.3.3 rich==9.13.0 rsa==3.4.2 s3cmd==2.1.0 s3transfer==0.5.0 sacremoses==0.0.46 schematics==2.1.0 scikit-learn==0.23.1 scipy==1.5.4 scispacy==0.2.4 screed==1.0 seaborn==0.10.1 Send2Trash==1.5.0 sentencepiece==0.1.96 seqmagick==0.7.0 six==1.16.0 slurm-gpustat==0.0.7 smmap==5.0.0 sniffio==1.2.0 snowballstemmer==1.2.1 sortedcontainers==2.1.0 soupsieve==2.3.1 sourmash==3.5.0 spacy==2.3.7 spglib==1.16.0 Sphinx==1.8.3 sphinxcontrib-websupport==1.1.0 SQLAlchemy==1.4.27 sqlparse==0.2.4 srsly==1.0.2 statistics==1.0.3.5 stevedore==3.5.0 suspenders==0.2.6 tensorboard==1.10.0 tensorboard-plugin-wit==1.8.0 tensorboardX==2.4.1 tensorflow==1.10.1 termcolor==1.1.0 terminado==0.8.3 terminaltables==3.1.0 testpath==0.4.4 texttable==1.6.2 Theano==1.0.3 thinc==7.4.5 threadpoolctl==2.1.0 tigmint==1.1.2 tinydb==3.13.0 tokenizers==0.8.1rc1 torch==1.6.0 torchvision==0.8.2 tornado==6.1 tqdm==4.46.1 traitlets==4.3.2 transformers==3.0.2 trash-cli==0.17.1.14.post0 typing-extensions==3.7.4.3 umi-tools==1.0.0 urllib3==1.26.7 virtualenv==15.1.0 wasabi==0.6.0 watchdog==2.1.6 wcwidth==0.1.7 webencodings==0.5.1 websocket-client==1.3.1 Werkzeug==0.14.1 widgetsnbextension==3.5.1 word2number==1.1 xopen==1.4.0 yapf==0.31.0 yarl==1.7.2 zipp==3.1.0 ```

Steps to reproduce

Example source:

``` from allennlp.models.model import Model scierc_pure = Model.from_archive('https://nlp.cs.princeton.edu/projects/pure/scierc_models/ent-scib-ctx300.zip') ```

dirkgr commented 1 year ago

Zip files are not tar files. This part of the code expects a tar file. Why would you expect this to work? Where did you get an AllenNLP model stored as a zip file?

serenalotreck commented 1 year ago

I was looking at models from the PURE project. I had assumed I could use allennlp to load them because some of the code to build their models involves allennlp methods. Is it possible that the models are actually made by another library (e.g. torch) & I need to use that library to load them? I tried torch.load as a cursory attempt, but it failed.