choose/adapt encoding when converting pdf in Python

alwil commented 1 year ago

Tried the following:

encode and decode in utf8: text = page.get_text().encode("utf_8").decode("utf_8")
- RESULT: illegible characters

encode and decode to all possible encodings:

RESULT: illegible characters


all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(file): with fitz.open(file) as doc: page = doc[1] for i in all_codecs: for j in all_codecs: try: text = page.get_text().encode(i).decode(j) print('conversion from ', i, ' to ', j ,'successful', text[:10]) except: pass


- check the encoding with the `chardet` package : 
   + **RESULT**: No encoding detected

import chardet

with open(file,'rb') as f: data = "" data = f.read() encoding=chardet.detect(data)['encoding']

encoding


- check the encoding with the `charset_normalizer` package : 
   + **RESULT**: No encoding detected

from charset_normalizer import from_path

results = from_path('./my_subtitle.srt')

print(str(results.best()))


- checked the problematic file encoding properties in Adobe Acrobat Reader: `File->Properties-> Fonts-> Encoding` ( alternative in Python: `page.get_fonts()`)
   + **RESULT**: the encoding seems to be "Identy-H"

- I had a short search of the `Identity-H` term: [here](https://community.adobe.com/t5/acrobat-discussions/identity-h-encoding/td-p/10400841) and  [here](https://tex.stackexchange.com/questions/526157/what-is-identity-h-encoding-should-it-be-avoided-and-if-so-how)
   + **RESULT**: It seems that it's a common encoding, however the they need a “ToUnicode CMap”. This is normally provided unless non-Unicode fonts are used in InDesign or a similar software: 
    > Exporting from InDesign or using Acrobat PDFMaker for Word should get this right, unless non-Unicode fonts are used. Don’t use such fonts.

- I tested if there's “ToUnicode CMap” by copy-pasting text from PDF to various text editors. It's not possible to find out directly whether file has the “ToUnicode CMap”, but if it does, a simple copy-paste should work fine:
> To extract text when this encoding is used, the PDF also needs a “ToUnicode CMap”. You cannot see if one of these is present.

> An easy test to see if a ToUnicode CMap is present for a Identity-H font: If copy&paste mostly works, you certainly have a ToUnicode CMap
   + **RESULT**: It seems that the mapping is missing. 

There is some discussion on how to deal with this issue on the PyMUPDF GitHub page:
- https://github.com/pymupdf/PyMuPDF/issues/365
- https://github.com/pymupdf/PyMuPDF/issues/530

But it seems focusing on the issue would drive us away from the main objectives of the project.

alwil commented 1 year ago

@cforgaci , Please see my comment explaining what steps were made to work on the encoding issue and possible root of the problem. I suggest not focusing on this issue at the moment and come back to it if there's more PDF files with similar problem.

If you agree, please feel free to mark the issue as closed.

cforgaci commented 1 year ago

Thank you @alwil for following up on this. I agree we put this on hold and focus our attention on the main objectives of the project. I am closing this issue, as you suggest.

UD3-Lab / mintEMU

choose/adapt encoding when converting pdf in Python #18