VikParuchuri / surya

OCR, layout analysis, reading order, line detection in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
9.77k stars 632 forks source link

UnicodeEncodeError: 'charmap' codec can't encode character #38

Open RandomInternetPreson opened 6 months ago

RandomInternetPreson commented 6 months ago

First of all, really great project!! This is a fantastic OCR model!

I have a test file .pdf called webpage.pdf

webpage.pdf

I can read all the text of the pdf using the UI provided, each page can have its text read. However, when I try running the entire pdf through the command prompt using this:

surya_ocr webpage.pdf --results_dir C:\Users\myself\Desktop\WebSearchExtension --langs hi,en

Several pages will process just fine until it gets to the end of page 4. I then get the following error in the command prompt and there are no subsequent pages read:

(surya) PS C:\Users\myself\Desktop\WebSearchExtension> surya_ocr webpage.pdf --results_dir C:\Users\myself\Desktop\WebSearchExtension --langs hi,en C:\Users\myself\miniconda3\envs\surya\lib\site-packages\transformers\utils\generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( C:\Users\myself\miniconda3\envs\surya\lib\site-packages\transformers\utils\generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( Loading detection model vikp/surya_det on device cuda with dtype torch.float16 Loading recognition model vikp/surya_rec on device cuda with dtype torch.float16 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.38it/s] Recognizing Text: 0%| | 0/1 [00:00<?, ?it/s]C:\Users\myself\miniconda3\envs\surya\lib\site-packages\transformers\generation\utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn( Recognizing Text: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.96s/it] Traceback (most recent call last): File "C:\Users\myself\miniconda3\envs\surya\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\myself\miniconda3\envs\surya\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\myself\miniconda3\envs\surya\Scripts\surya_ocr.exe\__main__.py", line 7, in <module> File "C:\Users\myself\miniconda3\envs\surya\lib\site-packages\ocr_text.py", line 75, in main json.dump(out_preds, f, ensure_ascii=False) File "C:\Users\myself\miniconda3\envs\surya\lib\json\__init__.py", line 180, in dump fp.write(chunk) File "C:\Users\myself\miniconda3\envs\surya\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u25aa' in position 1: character maps to <undefined>

Thanks again, I have tried a lot of OCR models and this one works well!

VikParuchuri commented 6 months ago

Try changing line 74 in C:\Users\myself\miniconda3\envs\surya\lib\site-packages\ocr_text.py to:

with open(os.path.join(result_path, "results.json"), "w+", encoding="utf-8") as f:

(add the encoding = utf-8). I think your system defaults to another encoding, which is causing issues.

If this works for you, I'll make a fix to the code.

RandomInternetPreson commented 6 months ago

Frick! Worked perfectly <3 Thank you!

Again really great stuff! This is a great OCR model!!!