bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

EOFError: Compressed file ended before the end-of-stream marker was reached #44

Closed aimanmutasem closed 4 years ago

aimanmutasem commented 4 years ago

Dear @all,

I'm trying to load the Dutch BPEmb model with vocabulary size 50k and 100-dimensional embeddings.

bpemb_de = BPEmb(lang="de", vs=50000)

I got an EOFError error:

EOFError Traceback (most recent call last)

in 1 import bpemb 2 from bpemb import BPEmb ----> 3 bpemb_de = BPEmb(lang="de", vs=50000) ~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in __init__(self, lang, vs, dim, cache_dir, preprocess, encode_extra_options, add_pad_emb, vs_fallback, segmentation_only, model_file, emb_file) 188 else: 189 emb_file = self.emb_tpl.format(lang=lang, vs=vs, dim=dim) --> 190 self.emb_file = self._load_file(emb_file, archive=True) 191 self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb) 192 self.most_similar = self.emb.most_similar ~/anaconda3/lib/python3.6/site-packages/bpemb/bpemb.py in _load_file(self, file, archive, cache_dir) 226 file_url = self.base_url + file + suffix 227 print("downloading", file_url) --> 228 return http_get(file_url, cached_file, ignore_tardir=True) 229 230 def __repr__(self): ~/anaconda3/lib/python3.6/site-packages/bpemb/util.py in http_get(url, outfile, ignore_tardir) 47 import tarfile 48 tf = tarfile.open(fileobj=temp_file) ---> 49 members = tf.getmembers() 50 if len(members) != 1: 51 raise NotImplementedError("TODO: extract multiple files") ~/anaconda3/lib/python3.6/tarfile.py in getmembers(self) 1759 self._check() 1760 if not self._loaded: # if we want to obtain a list of -> 1761 self._load() # all members, we first have to 1762 # scan the whole archive. 1763 return self.members ~/anaconda3/lib/python3.6/tarfile.py in _load(self) 2356 """ 2357 while True: -> 2358 tarinfo = self.next() 2359 if tarinfo is None: 2360 break ~/anaconda3/lib/python3.6/tarfile.py in next(self) 2287 # Advance the file pointer. 2288 if self.offset != self.fileobj.tell(): -> 2289 self.fileobj.seek(self.offset - 1) 2290 if not self.fileobj.read(1): 2291 raise ReadError("unexpected end of data") ~/anaconda3/lib/python3.6/gzip.py in seek(self, offset, whence) 366 elif self.mode == READ: 367 self._check_not_closed() --> 368 return self._buffer.seek(offset, whence) 369 370 return self.offset ~/anaconda3/lib/python3.6/_compression.py in seek(self, offset, whence) 141 # Read and discard data until we reach the desired position. 142 while offset > 0: --> 143 data = self.read(min(io.DEFAULT_BUFFER_SIZE, offset)) 144 if not data: 145 break ~/anaconda3/lib/python3.6/gzip.py in read(self, size) 480 break 481 if buf == b"": --> 482 raise EOFError("Compressed file ended before the " 483 "end-of-stream marker was reached") 484 EOFError: Compressed file ended before the end-of-stream marker was reached Kindly, any suggestions to fix this issue !!
bheinzerling commented 4 years ago

I just checked and there doesn't seem to be anything wrong with the file on the server.

The error message "EOFError: Compressed file ended before the end-of-stream marker was reached" indicates that the file wasn't downloaded completely.

Can you try deleting the cache? It's in your home directory: ~/.cache/bpemb/de

aimanmutasem commented 4 years ago

Thank you @bheinzerling for your response :)

Unfortunately, there is no directory '~/.cache/bpemb/de'. Do you know another efficient way to remove the "de"?

JulesBelveze commented 3 years ago

@aimanmutasem I'm actually facing the same problem, did you manage to fix it?

aimanmutasem commented 3 years ago

@aimanmutasem I'm actually facing the same problem, did you manage to fix it?

Hello @JulesBelveze , try to change vs to 30000 hope it will work. :)