bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #66

Open srolskyi opened 7 months ago

srolskyi commented 7 months ago

Fresh installation, setup new environment (python 3.9.18 or 3.12):

serg: ~ : python3 -m venv new_env serg: ~ : source new_env/bin/activate (new_env) serg: ~ : pip install bpemb gensim _Collecting bpemb Downloading bpemb-0.3.4-py3-none-any.whl.metadata (19 kB) Collecting gensim Using cached gensim-4.3.2-cp312-cp312-macosx_10_9_universal2.whl Collecting numpy (from bpemb) Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 949.1 kB/s eta 0:00:00 Collecting requests (from bpemb) Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB) Collecting sentencepiece (from bpemb) Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB) Collecting tqdm (from bpemb) Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.6 MB/s eta 0:00:00 Collecting scipy>=1.7.0 (from gensim) Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (217 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.9/217.9 kB 3.3 MB/s eta 0:00:00 Collecting smart-open>=1.8.1 (from gensim) Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB) Collecting wrapt (from smart-open>=1.8.1->gensim) Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB) Collecting charset-normalizer<4,>=2 (from requests->bpemb) Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB) Collecting idna<4,>=2.5 (from requests->bpemb) Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB) Collecting urllib3<3,>=1.21.1 (from requests->bpemb) Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB) Collecting certifi>=2017.4.17 (from requests->bpemb) Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB) Downloading bpemb-0.3.4-py3-none-any.whl (19 kB) Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 67.8 MB/s eta 0:00:00 Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl (31.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.4/31.4 MB 59.3 MB/s eta 0:00:00 Downloading smart_open-7.0.1-py3-none-any.whl (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.6 MB/s eta 0:00:00 Downloading requests-2.31.0-py3-none-any.whl (62 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 4.4 MB/s eta 0:00:00 Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 42.6 MB/s eta 0:00:00 Downloading tqdm-4.66.2-py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 7.3 MB/s eta 0:00:00 Downloading certifi-2024.2.2-py3-none-any.whl (163 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 12.8 MB/s eta 0:00:00 Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 119.4/119.4 kB 10.6 MB/s eta 0:00:00 Downloading idna-3.6-py3-none-any.whl (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.9 MB/s eta 0:00:00 Downloading urllib3-2.2.1-py3-none-any.whl (121 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 10.1 MB/s eta 0:00:00 Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0arm64.whl (38 kB) Installing collected packages: sentencepiece, wrapt, urllib3, tqdm, numpy, idna, charset-normalizer, certifi, smart-open, scipy, requests, gensim, bpemb Successfully installed bpemb-0.3.4 certifi-2024.2.2 charset-normalizer-3.3.2 gensim-4.3.2 idna-3.6 numpy-1.26.4 requests-2.31.0 scipy-1.12.0 sentencepiece-0.2.0 smart-open-7.0.1 tqdm-4.66.2 urllib3-2.2.1 wrapt-1.16.0

(new_env) serg: ~ : python3 --version
Python 3.12.2

then run python3 -c "from bpemb import BPEmb; bpemb_en = BPEmb(lang='en', dim=100)"

and got error:

_Traceback (most recent call last): File "", line 1, in File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/bpemb.py", line 191, in init self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/util.py", line 78, in load_word2vec_file vecs = KeyedVectors.load_word2vec_format(word2vec_file, binary=binary) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 1719, in load_word2vec_format return _load_word2vec_format( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 2058, in _load_word2vec_format header = utils.to_unicode(fin.readline(), encoding=encoding) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/serg/newenv/lib/python3.12/site-packages/gensim/utils.py", line 365, in any2unicode return str(text, encoding, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any ideas where am I make a mistake?

stefan-it commented 7 months ago

Hey @srolskyi and @bheinzerling ,

I debugged that issue and debug-printed the path for self.emb_file:

$ ls -hl /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-rw-r-- 1 stefan stefan 3,7M Mär 15 16:34 /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin

And it was downloaded from https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz.

However, when I download the archive manually and extract it, it has the following size:

$ ls -hl ~/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-r--r-- 1 stefan stefan 3,9M Mär 19  2018 /home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin

With this file I can load the vectors without any problem:

n [1]: from gensim.models import KeyedVectors

In [2]: vecs = KeyedVectors.load_word2vec_format("/home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin", binary=True)

In [3]: vecs
Out[3]: <gensim.models.keyedvectors.KeyedVectors at 0x71cbac1c0410>

So I heavily think that the unpacking routines are currently not working and "broken" word embeddings file is then trying to be loaded - causing the error.

stefan-it commented 7 months ago

After some more debugging and reading the code:

stefan@ae-13412:~$ curl -LI https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
HTTP/1.1 301 Moved Permanently
Date: Fri, 15 Mar 2024 15:43:13 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.34
Location: https://bpemb.h-its.org/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
Content-Type: text/html; charset=iso-8859-1

HTTP/2 200 
server: nginx
date: Fri, 15 Mar 2024 15:43:14 GMT
content-type: application/gzip
content-length: 3784656
last-modified: Mon, 09 Apr 2018 22:27:16 GMT
etag: "39bfd0-56971e878b900"
accept-ranges: bytes
strict-transport-security: max-age=15768000

At the end, you can see that the redirected request has an application/gzip content type.

However, the current code is expecting:

https://github.com/bheinzerling/bpemb/blob/1c630358f6fd522925008aa749eccd01ca5633af/bpemb/util.py#L54

an application/x-gzip content type header.

This is the reason why the archive is not properly extracted.

@bheinzerling I think best option here is to check if gzip is found in the content type header, e.g.:

if "gzip" in headers.get("Content-Type"):

Then the archive is properly downloaded, extracted and loaded :)

srolskyi commented 7 months ago

thank you @stefan-it for your investigation! @bheinzerling can we expect some fix in near future? seems it's global issue and no-one can download this files.....

mahiforu commented 7 months ago

@bheinzerling @stefan-it , thanks for the investigation -> right now our production is not working because we are depending on package. 1) I know there are no changes from this package -> so resource "https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz" that we downloading the zip changed the content type to application/gzip where as in code we checking for application/x-gzip is there any change in resource that we are accessing ?just trying to understand what change is causing this issue suddenly?

2) Can you please suggest any temporary solution to fix it ?

stefan-it commented 7 months ago

I created a PR for a fix. In the meantime you should be able to use this fixed version with:

git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca

in a requirements.txt file or via pip:

pip3 install --upgrade git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca

When the fix is accepted/merged into upstream here, then @bheinzerling only needs to release a new version.

bheinzerling commented 7 months ago

@srolskyi Thanks for reporting this issue! @stefan-it Thanks even more for debugging and creating a fix!

My guess is that the admins of the server on which BPEmb is hosted updated or migrated something. In any case, thanks to Stefan's fix everything seems to be working again.

I released a new version on PyPI that includes the fix and should resolve this issue:

pip install --upgrade bpemb

Leaving this issue open a bit for visibility

psydok commented 1 month ago

What version is fix in? 0.3.5? I'm using version 0.3.0. Same error.