Unable to load TSL model

mrzhnex commented 6 months ago

I don't really know what to write it about, just as only I getting this error, i suppose. Was trying different settings, source languages, models... It keeps downloading to directory, but when it's time to actually load model - it threw an error. What kind of permission this app need?

Windows 10. Symlink'ed .ocr_translate directory CPU version currently (getting same error in GPU as well)

2024-04-27 17:39:30,023 - 2024-04-27 17:39:30,049 - 2024-04-27 17:39:40,772 - 2024-04-27 17:39:40,775 - 2024-04-27 17:39:40,798 - 2024-04-27 17:39:40,829 - 2024-04-27 17:39:51,882 - 2024-04-27 17:39:51,893 - 2024-04-27 17:39:54,027 - 2024-04-27 17:39:54,030 - 2024-04-27 17:39:54,037 - 2024-04-27 17:39:54,047 - 2024-04-27 17:39:59,675 - 2024-04-27 17:39:59,676 - 2024-04-27 17:40:02,111 - Using CPU. Note: 2024-04-27 17:40:05,808 - 2024-04-27 17:40:05,819 - 2024-04-27 17:40:06,495 - 2024-04-27 17:53:03,187 - ERROR - 2024-04-27 17:53:05,137 - INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3776 INFO - ocr.general:views - SET LANG: {'lang_src': 'ja', 'lang_dst': 'en'} INFO - django.server:basehttp - "POST /set_lang/ HTTP/1.1" 200 2 INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3926 INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3926 INFO - ocr.general:views - SET LANG: {'lang_src': 'ja', 'lang_dst': 'en'} INFO - django.server:basehttp - "POST /set_lang/ HTTP/1.1" 200 2 INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3926 INFO - ocr.general:views - LOAD MODELS: {'box_model_id': 'easyocr', 'ocr_model_id': 'tesseract', 'tsl_model_id': 'facebook/m2m100_1.2B'} INFO - ocr.general:box - Loading BOX model: easyocr INFO - plugin:plugin - Loading BOX model: easyocr This module is much faster with a GPU. INFO - ocr.general:ocr - Loading OCR model: tesseract INFO - ocr.general:tsl - Loading TSL model: facebook/m2m100_1.2B INFO - plugin:plugin - Loading TSL model: facebook/m2m100_1.2B ocr.general:views - Failed to load models: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted. INFO - django.server:basehttp - - Broken pipe from ('127.0.0.1', 60097)

Please tell me if I need to provide other information.

P.S. Using latest release from 17.12.2023 (Same error at version from 29.10.2023)

Crivella commented 6 months ago

This seems to be related to

but in both cases i do not think it is clear what the root cause is (the error you see is the server reporting an exception happened in transformers).

The server does not require any particular permission, just normal R/W which you should have as I imagine the database has already been created the database.

A few troubleshooting step i would suggest that might solve or give us more clues

Try deleting the .ocr_translate and start fresh without using a symlink (also out of curiosity did you use the command prompt/power shell to create a symlink? Creating a shortcut will not do the trick.)
Try loading a smaller model like staka/fugumt (see also above, if it is a problem of space either Disk or RAM)
You can also load only the OCR/TSL models individually (e.g. load only the TSL model and see if it manages to or give the same error)
As this is related with HuggingFace models I would be curios if you are able to load an OCR model like kha-white.

As a reference when downloading staka/fugumt-ja-en i just tried and you should see something like

config.json: 100%|████████████████████████████████████████████████████████████████████████| 1.03k/1.03k [00:00<?, ?B/s]
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████| 121M/121M [00:01<00:00, 76.5MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 289/289 [00:00<?, ?B/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 42.0/42.0 [00:00<?, ?B/s]
source.spm: 100%|███████████████████████████████████████████████████████████████████| 797k/797k [00:00<00:00, 2.23MB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████| 861k/861k [00:00<00:00, 2.37MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 74.0/74.0 [00:00<?, ?B/s]

mrzhnex commented 6 months ago

12322222

Yesh, to create symlink i used powershell. First, i used random model (from presented in browser extension), but i got an error. Then, i tried others - get error by error, until my disk space is run out. So, i switched to D:/, create symlink and tried the rest of models.

This screenshot from freash start (no symlink), just now. Maybe there is some hint at "could not find image processor class", but i don't really think so... Is there something about CUDA and python in general?

Crivella commented 6 months ago

Python by itself should not access the CUDA api directly, it is usually done by C/C++ code under the hood. The ... processor class ... has nothing to do with it. Interestingly loading the VED model did not give you any error, only the SEQ2SEQ seems to.

I am not sure what the problem here could be, it might require some digging through the transformers library.

As another possible patchwork solution, could you try also the manual download approach?

Under your .ocr_translate create a folder named staka
Under staka create a folder named fugumt-ja-en
Under this folder download all the json files at https://huggingface.co/staka/fugumt-ja-en/tree/main
(This could also be done with GIT+LFS if you are practical with it)

Another thing, for running the code could you try doing it from powershell after setting the following environment variables, as the info if the code is being loaded from manually stores files is shown as a debug message

$env:DJANGO_DEBUG="true"
$env:DJANGO_LOG_LEVEL="DEBUG"

and than either write the path to the EXE, or depending on your terminal you could also drag and drop the file on it and it will autowrite the file path.

You could even try playing around with the TRANSFORMERS_CACHE environment variables to tell transformers where to store stuff. For more details see docs and where the variable is being used in the plugin enabling HuggingFace models https://github.com/Crivella/ocr_translate-hugging_face/blob/d6ae9d8f0b6f48b201bfc2ff74a8383909c7a680/ocr_translate_hugging_face/plugin.py#L115

mrzhnex commented 6 months ago

Found source problem. It was cyrillic symbols in user's name. Was changed name to latin (hell of a job i would say), or how is it all calls...

Now it works like a charm, like a clock. Possible upgrade for future versions: 1) Add support for non-latin symbols in path. 2) Add possibility to change default folder (user/.ocr_translate); I believe there is some tech in config or startup args, but i am blind or sms tell me if it is here, please.

Kinda curious, why application manage to access to /user/.ocr_translate with cyrillic symbols in username, but when it is time to load translation model it gave an error? Maybe different load/access methods?

But, the problem is gone, at least for me and you are now aware of possibility of it for future. Should I close the Issue? I suppose I need to.

Thank you for your support, great work overall!

Crivella commented 6 months ago

Nice you were able to figure it out.

It is already possible to control where model are stored using the TRANSFORMERS_CACHE environment variable see here for more details.

I am not 100% sure (should investigate) but i think the problem with the non-latin characters is inside the transformers library, since the problems is with the models and my code was able to create the database. Might open an issue or PR with them in case i can pin point where stuff is going wrong.

Thanks, hope you will enjoy the tool ;)

Crivella / ocr_translate

Unable to load TSL model #27