Closed mrzhnex closed 6 months ago
This seems to be related to
but in both cases i do not think it is clear what the root cause is (the error you see is the server reporting an exception happened in transformers
).
The server does not require any particular permission, just normal R/W which you should have as I imagine the database has already been created the database.
A few troubleshooting step i would suggest that might solve or give us more clues
.ocr_translate
and start fresh without using a symlink (also out of curiosity did you use the command prompt/power shell to create a symlink? Creating a shortcut will not do the trick.)staka/fugumt
(see also above, if it is a problem of space either Disk or RAM)As a reference when downloading staka/fugumt-ja-en
i just tried and you should see something like
config.json: 100%|████████████████████████████████████████████████████████████████████████| 1.03k/1.03k [00:00<?, ?B/s]
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████| 121M/121M [00:01<00:00, 76.5MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 289/289 [00:00<?, ?B/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 42.0/42.0 [00:00<?, ?B/s]
source.spm: 100%|███████████████████████████████████████████████████████████████████| 797k/797k [00:00<00:00, 2.23MB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████| 861k/861k [00:00<00:00, 2.37MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 74.0/74.0 [00:00<?, ?B/s]
Yesh, to create symlink i used powershell. First, i used random model (from presented in browser extension), but i got an error. Then, i tried others - get error by error, until my disk space is run out. So, i switched to D:/, create symlink and tried the rest of models.
This screenshot from freash start (no symlink), just now. Maybe there is some hint at "could not find image processor class", but i don't really think so... Is there something about CUDA and python in general?
Python by itself should not access the CUDA api directly, it is usually done by C/C++ code under the hood.
The ... processor class ...
has nothing to do with it.
Interestingly loading the VED model did not give you any error, only the SEQ2SEQ seems to.
I am not sure what the problem here could be, it might require some digging through the transformers
library.
As another possible patchwork solution, could you try also the manual download approach?
.ocr_translate
create a folder named staka
staka
create a folder named fugumt-ja-en
Another thing, for running the code could you try doing it from powershell after setting the following environment variables, as the info if the code is being loaded from manually stores files is shown as a debug message
$env:DJANGO_DEBUG="true"
$env:DJANGO_LOG_LEVEL="DEBUG"
and than either write the path to the EXE, or depending on your terminal you could also drag and drop the file on it and it will autowrite the file path.
You could even try playing around with the TRANSFORMERS_CACHE
environment variables to tell transformers where to store stuff. For more details see docs and where the variable is being used in the plugin enabling HuggingFace models https://github.com/Crivella/ocr_translate-hugging_face/blob/d6ae9d8f0b6f48b201bfc2ff74a8383909c7a680/ocr_translate_hugging_face/plugin.py#L115
Found source problem. It was cyrillic symbols in user's name. Was changed name to latin (hell of a job i would say), or how is it all calls...
Now it works like a charm, like a clock. Possible upgrade for future versions: 1) Add support for non-latin symbols in path. 2) Add possibility to change default folder (user/.ocr_translate); I believe there is some tech in config or startup args, but i am blind or sms tell me if it is here, please.
Kinda curious, why application manage to access to /user/.ocr_translate with cyrillic symbols in username, but when it is time to load translation model it gave an error? Maybe different load/access methods?
But, the problem is gone, at least for me and you are now aware of possibility of it for future. Should I close the Issue? I suppose I need to.
Thank you for your support, great work overall!
Nice you were able to figure it out.
It is already possible to control where model are stored using the TRANSFORMERS_CACHE
environment variable see here for more details.
I am not 100% sure (should investigate) but i think the problem with the non-latin characters is inside the transformers library, since the problems is with the models and my code was able to create the database. Might open an issue or PR with them in case i can pin point where stuff is going wrong.
Thanks, hope you will enjoy the tool ;)
I don't really know what to write it about, just as only I getting this error, i suppose. Was trying different settings, source languages, models... It keeps downloading to directory, but when it's time to actually load model - it threw an error. What kind of permission this app need?
Windows 10. Symlink'ed .ocr_translate directory CPU version currently (getting same error in GPU as well)
2024-04-27 17:39:30,023 - INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 2024-04-27 17:39:30,049 - INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3776 2024-04-27 17:39:40,772 - INFO - ocr.general:views - SET LANG: {'lang_src': 'ja', 'lang_dst': 'en'} 2024-04-27 17:39:40,775 - INFO - django.server:basehttp - "POST /set_lang/ HTTP/1.1" 200 2 2024-04-27 17:39:40,798 - INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 2024-04-27 17:39:40,829 - INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3926 2024-04-27 17:39:51,882 - INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 2024-04-27 17:39:51,893 - INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3926 2024-04-27 17:39:54,027 - INFO - ocr.general:views - SET LANG: {'lang_src': 'ja', 'lang_dst': 'en'} 2024-04-27 17:39:54,030 - INFO - django.server:basehttp - "POST /set_lang/ HTTP/1.1" 200 2 2024-04-27 17:39:54,037 - INFO - django.server:basehttp - "GET /get_active_options/ HTTP/1.1" 200 64 2024-04-27 17:39:54,047 - INFO - django.server:basehttp - "GET / HTTP/1.1" 200 3926 2024-04-27 17:39:59,675 - INFO - ocr.general:views - LOAD MODELS: {'box_model_id': 'easyocr', 'ocr_model_id': 'tesseract', 'tsl_model_id': 'facebook/m2m100_1.2B'} 2024-04-27 17:39:59,676 - INFO - ocr.general:box - Loading BOX model: easyocr 2024-04-27 17:40:02,111 - INFO - plugin:plugin - Loading BOX model: easyocr Using CPU. Note: This module is much faster with a GPU. 2024-04-27 17:40:05,808 - INFO - ocr.general:ocr - Loading OCR model: tesseract 2024-04-27 17:40:05,819 - INFO - ocr.general:tsl - Loading TSL model: facebook/m2m100_1.2B 2024-04-27 17:40:06,495 - INFO - plugin:plugin - Loading TSL model: facebook/m2m100_1.2B 2024-04-27 17:53:03,187 - ERROR - ocr.general:views - Failed to load models: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted. 2024-04-27 17:53:05,137 - INFO - django.server:basehttp - - Broken pipe from ('127.0.0.1', 60097)
Please tell me if I need to provide other information.
P.S. Using latest release from 17.12.2023 (Same error at version from 29.10.2023)