Dadmatech / DadmaTools

DadmaTools is a Persian NLP tools developed by Dadmatech Co.
Apache License 2.0
179 stars 39 forks source link

Cannot load fa_tokenizer.pt #45

Closed NightMachinery closed 8 months ago

NightMachinery commented 1 year ago

I have downloaded fa_tokenizer.pt manually from the URL https://www.dropbox.com/s/bajpn68bp11o78s/fa_ewt_tokenizer.pt?dl=1. It's 636k in size. Its md5 is:

2097a125c5f85b36d569857bd60d51b7  fa_tokenizer.pt

It cannot be loaded, however:

import dadmatools.pipeline.language as language

# here lemmatizer and pos tagger will be loaded
# as tokenizer is the default tool, it will be loaded as well even without calling
pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh' 
nlp = language.Pipeline(pips)

# you can see the pipeline with this code
print(nlp.analyze_pipes(pretty=True))

# doc is an SpaCy object
doc = nlp('از قصهٔ کودکیشان که می‌گفت، گاهی حرص می‌خورد!')
Model fa_tokenizer exists in /Users/evar/.pernlp/fa_tokenizer.pt
2022-11-21 09:05:41,580 Cannot load model from /Users/evar/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/saved_models/fa_tokenizer/fa_tokenizer.pt
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [5], in <cell line: 6>()
      3 # here lemmatizer and pos tagger will be loaded
      4 # as tokenizer is the default tool, it will be loaded as well even without calling
      5 pips = 'tok,lem,pos,dep,chunk,cons,spellchecker,kasreh' 
----> 6 nlp = language.Pipeline(pips)
      8 # you can see the pipeline with this code
      9 print(nlp.analyze_pipes(pretty=True))

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:258, in Pipeline.__new__(cls, pipeline)
    257 def __new__(cls, pipeline):
--> 258     language = NLP('fa', pipeline)
    259     nlp = language.nlp
    260     return nlp

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/pipeline/language.py:64, in NLP.__init__(self, lang, pipelines)
     58 # if 'def-norm' in pipelines:
     59 #     global normalizer_model
     60 #     normalizer_model = normalizer.load_model()
     61 #     self.nlp.add_pipe('normalizer', first=True)
     63 global tokenizer_model
---> 64 tokenizer_model = tokenizer.load_model()
     65 self.nlp.add_pipe('tokenizer')
     67 global mwt_model

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenizer.py:125, in load_model()
    123 mwt_dict = load_mwt_dict(args['mwt_json_file'])
    124 use_cuda = args['cuda'] and not args['cpu']
--> 125 trainer = Trainer(model_file=args['save_dir'], use_cuda=use_cuda)
    126 loaded_args, vocab = trainer.args, trainer.vocab
    128 for k in loaded_args:

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:19, in Trainer.__init__(self, args, vocab, model_file, use_cuda)
     16 self.use_cuda = use_cuda
     17 if model_file is not None:
     18     # load everything from file
---> 19     self.load(model_file)
     20 else:
     21     # build model from scratch
     22     self.args = args

File ~/anaconda/envs/p310/lib/python3.10/site-packages/dadmatools/models/tokenization/trainer.py:85, in Trainer.load(self, filename)
     83 def load(self, filename):
     84     try:
---> 85         checkpoint = torch.load(filename, lambda storage, loc: storage)
     86     except BaseException:
     87         logger.error("Cannot load model from {}".format(filename))

File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:713, in load(f, map_location, pickle_module, **pickle_load_args)
    711             return torch.jit.load(opened_file)
    712         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)

File ~/anaconda/envs/p310/lib/python3.10/site-packages/torch/serialization.py:938, in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    936 assert key in deserialized_objects
    937 typed_storage = deserialized_objects[key]
--> 938 typed_storage._storage._set_from_file(
    939     f, offset, f_should_read_directly,
    940     torch._utils._element_size(typed_storage.dtype))
    941 if offset is not None:
    942     offset = f.tell()

RuntimeError: unexpected EOF, expected 312321 more bytes. The file might be corrupted.

I am using dadmatools==1.5.2, Python 3.10, macOS 12.2.1.

NightMachinery commented 1 year ago

It seems the issue is that the other models are not downloaded, and Dadmatools think they are because this single model has been downloaded. (Update: I no longer think this is the case. The problem is probably with the environment versions.)

Considering the internet sucks in Iran, you should use a better downloader manager for downloading these files than what you are now using. It just stops downloading in the middle. aria2 is a good choice (but it might not work on Windows). Recent versions of curl are also better than what you are currently using.

NightMachinery commented 1 year ago

Trying to run Dadmatools on Colab, it loads fa_tokenizer.pt, but it still gets stuck on:

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/nevise/resolve/main/config.json

The above exception was the direct cause of the following exception:

RepositoryNotFoundError                   Traceback (most recent call last)
RepositoryNotFoundError: 404 Client Error. (Request ID: KbvMewrMkS8N1oT3CYw5S)

Repository Not Found for url: https://huggingface.co/nevise/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If the repo is private, make sure you are authenticated.

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/transformers/utils/hub.py](https://localhost:8080/#) in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
    423     except RepositoryNotFoundError:
    424         raise EnvironmentError(
--> 425             f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
    426             "listed on '[https://huggingface.co/models'\nIf](https://huggingface.co/models'/nIf) this is a private repository, make sure to "
    427             "pass a token having permission to this repo with `use_auth_token` or log in with "

OSError: nevise is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

I have logged in to Huggingface using

from huggingface_hub import notebook_login
!git config --global credential.helper store

notebook_login()

but it doesn't help.

sadeghjafari5528 commented 8 months ago

Now, the fa_tokenizer.pt load properly.