Closed alwil closed 1 year ago
@cforgaci , Please see my comment explaining what steps were made to work on the encoding issue and possible root of the problem. I suggest not focusing on this issue at the moment and come back to it if there's more PDF files with similar problem.
If you agree, please feel free to mark the issue as closed.
Thank you @alwil for following up on this. I agree we put this on hold and focus our attention on the main objectives of the project. I am closing this issue, as you suggest.
Tried the following:
text = page.get_text().encode("utf_8").decode("utf_8")
def find_codec(file): with fitz.open(file) as doc: page = doc[1] for i in all_codecs: for j in all_codecs: try: text = page.get_text().encode(i).decode(j) print('conversion from ', i, ' to ', j ,'successful', text[:10]) except: pass
import chardet
with open(file,'rb') as f: data = "" data = f.read() encoding=chardet.detect(data)['encoding']
encoding
from charset_normalizer import from_path
results = from_path('./my_subtitle.srt')
print(str(results.best()))