Open srolskyi opened 7 months ago
Hey @srolskyi and @bheinzerling ,
I debugged that issue and debug-printed the path for self.emb_file
:
$ ls -hl /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-rw-r-- 1 stefan stefan 3,7M Mär 15 16:34 /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin
And it was downloaded from https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
.
However, when I download the archive manually and extract it, it has the following size:
$ ls -hl ~/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-r--r-- 1 stefan stefan 3,9M Mär 19 2018 /home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin
With this file I can load the vectors without any problem:
n [1]: from gensim.models import KeyedVectors
In [2]: vecs = KeyedVectors.load_word2vec_format("/home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin", binary=True)
In [3]: vecs
Out[3]: <gensim.models.keyedvectors.KeyedVectors at 0x71cbac1c0410>
So I heavily think that the unpacking routines are currently not working and "broken" word embeddings file is then trying to be loaded - causing the error.
After some more debugging and reading the code:
stefan@ae-13412:~$ curl -LI https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
HTTP/1.1 301 Moved Permanently
Date: Fri, 15 Mar 2024 15:43:13 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.34
Location: https://bpemb.h-its.org/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
Content-Type: text/html; charset=iso-8859-1
HTTP/2 200
server: nginx
date: Fri, 15 Mar 2024 15:43:14 GMT
content-type: application/gzip
content-length: 3784656
last-modified: Mon, 09 Apr 2018 22:27:16 GMT
etag: "39bfd0-56971e878b900"
accept-ranges: bytes
strict-transport-security: max-age=15768000
At the end, you can see that the redirected request has an application/gzip
content type.
However, the current code is expecting:
an application/x-gzip
content type header.
This is the reason why the archive is not properly extracted.
@bheinzerling I think best option here is to check if gzip
is found in the content type header, e.g.:
if "gzip" in headers.get("Content-Type"):
Then the archive is properly downloaded, extracted and loaded :)
thank you @stefan-it for your investigation! @bheinzerling can we expect some fix in near future? seems it's global issue and no-one can download this files.....
@bheinzerling @stefan-it , thanks for the investigation -> right now our production is not working because we are depending on package. 1) I know there are no changes from this package -> so resource "https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz" that we downloading the zip changed the content type to application/gzip where as in code we checking for application/x-gzip is there any change in resource that we are accessing ?just trying to understand what change is causing this issue suddenly?
2) Can you please suggest any temporary solution to fix it ?
I created a PR for a fix. In the meantime you should be able to use this fixed version with:
git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca
in a requirements.txt
file or via pip:
pip3 install --upgrade git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca
When the fix is accepted/merged into upstream here, then @bheinzerling only needs to release a new version.
@srolskyi Thanks for reporting this issue! @stefan-it Thanks even more for debugging and creating a fix!
My guess is that the admins of the server on which BPEmb is hosted updated or migrated something. In any case, thanks to Stefan's fix everything seems to be working again.
I released a new version on PyPI that includes the fix and should resolve this issue:
pip install --upgrade bpemb
Leaving this issue open a bit for visibility
What version is fix in? 0.3.5? I'm using version 0.3.0. Same error.
Fresh installation, setup new environment (python 3.9.18 or 3.12):
serg: ~ : python3 -m venv new_env
serg: ~ : source new_env/bin/activate
(new_env) serg: ~ : pip install bpemb gensim
_Collecting bpemb Downloading bpemb-0.3.4-py3-none-any.whl.metadata (19 kB) Collecting gensim Using cached gensim-4.3.2-cp312-cp312-macosx_10_9_universal2.whl Collecting numpy (from bpemb) Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 949.1 kB/s eta 0:00:00 Collecting requests (from bpemb) Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB) Collecting sentencepiece (from bpemb) Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB) Collecting tqdm (from bpemb) Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.6 MB/s eta 0:00:00 Collecting scipy>=1.7.0 (from gensim) Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (217 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.9/217.9 kB 3.3 MB/s eta 0:00:00 Collecting smart-open>=1.8.1 (from gensim) Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB) Collecting wrapt (from smart-open>=1.8.1->gensim) Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB) Collecting charset-normalizer<4,>=2 (from requests->bpemb) Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB) Collecting idna<4,>=2.5 (from requests->bpemb) Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB) Collecting urllib3<3,>=1.21.1 (from requests->bpemb) Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB) Collecting certifi>=2017.4.17 (from requests->bpemb) Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB) Downloading bpemb-0.3.4-py3-none-any.whl (19 kB) Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 67.8 MB/s eta 0:00:00 Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl (31.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.4/31.4 MB 59.3 MB/s eta 0:00:00 Downloading smart_open-7.0.1-py3-none-any.whl (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.6 MB/s eta 0:00:00 Downloading requests-2.31.0-py3-none-any.whl (62 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 4.4 MB/s eta 0:00:00 Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 42.6 MB/s eta 0:00:00 Downloading tqdm-4.66.2-py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 7.3 MB/s eta 0:00:00 Downloading certifi-2024.2.2-py3-none-any.whl (163 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 12.8 MB/s eta 0:00:00 Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 119.4/119.4 kB 10.6 MB/s eta 0:00:00 Downloading idna-3.6-py3-none-any.whl (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.9 MB/s eta 0:00:00 Downloading urllib3-2.2.1-py3-none-any.whl (121 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 10.1 MB/s eta 0:00:00 Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0arm64.whl (38 kB) Installing collected packages: sentencepiece, wrapt, urllib3, tqdm, numpy, idna, charset-normalizer, certifi, smart-open, scipy, requests, gensim, bpemb Successfully installed bpemb-0.3.4 certifi-2024.2.2 charset-normalizer-3.3.2 gensim-4.3.2 idna-3.6 numpy-1.26.4 requests-2.31.0 scipy-1.12.0 sentencepiece-0.2.0 smart-open-7.0.1 tqdm-4.66.2 urllib3-2.2.1 wrapt-1.16.0then run
python3 -c "from bpemb import BPEmb; bpemb_en = BPEmb(lang='en', dim=100)"
and got error:
_Traceback (most recent call last): File "", line 1, in
File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/bpemb.py", line 191, in init
self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/util.py", line 78, in load_word2vec_file
vecs = KeyedVectors.load_word2vec_format(word2vec_file, binary=binary)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 1719, in load_word2vec_format
return _load_word2vec_format(
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 2058, in _load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/newenv/lib/python3.12/site-packages/gensim/utils.py", line 365, in any2unicode
return str(text, encoding, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
any ideas where am I make a mistake?