amir-zeldes / RFTokenizer

A character-wise tokenizer for morphologically rich languages
Other
27 stars 7 forks source link

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory #14

Closed ZvikaZ closed 3 months ago

ZvikaZ commented 3 months ago

Hi. I tried this simple code:

from rftokenizer import RFTokenizer

my_tokenizer = RFTokenizer(model="heb")    # I also tried heb.sm3
tokenized = my_tokenizer.rf_tokenize('שלום וברכה')
print(tokenized)

but it failed with:

C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Scripts\python.exe C:\Zvika\PycharmProjects\milon\parser\temp.py 
Traceback (most recent call last):
  File "C:\Zvika\PycharmProjects\milon\parser\temp.py", line 4, in <module>
    tokenized = my_tokenizer.rf_tokenize('שלום וברכה')
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\rftokenizer\tokenize_rf.py", line 923, in rf_tokenize
    self.load()
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\rftokenizer\tokenize_rf.py", line 540, in load
    self.bert = FlairTagger(seg=True)
                ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\rftokenizer\flair_pos_tagger.py", line 49, in __init__
    self.model = SequenceTagger.load(model_dir + lang_prefix + ".seg")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\flair\models\sequence_tagger_model.py", line 1036, in load
    return cast("SequenceTagger", super().load(model_path=model_path))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\flair\nn\model.py", line 555, in load
    return cast("Classifier", super().load(model_path=model_path))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\flair\nn\model.py", line 179, in load
    state = load_torch_state(model_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\flair\file_utils.py", line 352, in load_torch_state
    return torch.load(f, map_location="cpu")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\torch\serialization.py", line 1004, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Zvika\AppData\Local\pypoetry\Cache\virtualenvs\parser-aJ2KWzVO-py3.12\Lib\site-packages\torch\serialization.py", line 456, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Process finished with exit code 1

If it's relevant, I'm using Python 3.12.3, and this is the output of pip list:

Package                            Version
---------------------------------- -----------
accelerate                         0.31.0
beautifulsoup4                     4.12.3
boto3                              1.34.127
botocore                           1.34.127
bpemb                              0.3.5
certifi                            2024.6.2
charset-normalizer                 3.3.2
cloudpickle                        3.0.0
colorama                           0.4.6
conllu                             4.5.3
contourpy                          1.2.1
cycler                             0.12.1
Deprecated                         1.2.14
filelock                           3.15.1
flair                              0.13.0
fonttools                          4.53.0
fsspec                             2024.6.0
ftfy                               6.2.0
future                             1.0.0
gdown                              5.2.0
gensim                             4.3.2
huggingface-hub                    0.23.4
hyperopt                           0.2.7
idna                               3.7
intel-openmp                       2021.4.0
Janome                             0.5.0
Jinja2                             3.1.4
jmespath                           1.0.1
joblib                             1.3.2
kiwisolver                         1.4.5
langdetect                         1.0.9
lxml                               5.2.2
MarkupSafe                         2.1.5
matplotlib                         3.9.0
mkl                                2021.4.0
more-itertools                     10.3.0
mpld3                              0.5.10
mpmath                             1.3.0
networkx                           3.3
numpy                              1.26.4
packaging                          24.1
pandas                             2.1.2
pillow                             10.3.0
pip                                23.1
pptree                             3.1
protobuf                           5.27.1
psutil                             5.9.8
py4j                               0.10.9.7
pyparsing                          3.1.2
PySocks                            1.7.1
python-dateutil                    2.9.0.post0
pytorch_revgrad                    0.2.0
pytz                               2024.1
PyYAML                             6.0.1
regex                              2024.5.15
requests                           2.32.3
rftokenizer                        2.2.0
s3transfer                         0.10.1
safetensors                        0.4.3
scikit-learn                       1.3.2
scipy                              1.12.0
segtok                             1.5.11
semver                             3.0.2
sentencepiece                      0.2.0
setuptools                         67.6.1
six                                1.16.0
smart-open                         7.0.4
soupsieve                          2.5
sqlitedict                         2.1.0
sympy                              1.12.1
tabulate                           0.9.0
tbb                                2021.12.0
threadpoolctl                      3.5.0
tokenizers                         0.19.1
torch                              2.3.1
tqdm                               4.66.4
transformer-smaller-training-vocab 0.4.0
transformers                       4.41.2
typing_extensions                  4.12.2
tzdata                             2024.1
urllib3                            1.26.18
wcwidth                            0.2.13
wheel                              0.40.0
Wikipedia-API                      0.6.0
wrapt                              1.16.0
xgboost                            2.0.3
ZvikaZ commented 3 months ago

My mistake, it had problem downloading the .seg file. I deleted it, and it re-downloaded it, and now it's working fine.