ashutoshvarma / pyxpdf

Fast and memory-efficient Python PDF Parser based on xpdf sources
https://pyxpdf.readthedocs.io/
Other
40 stars 16 forks source link

Text Breaking when used for Gurmukhi(punjabi) script #42

Open anagha-choudhari19 opened 2 years ago

anagha-choudhari19 commented 2 years ago

I want to extract text from PDF for Gurmukhi script which is punjabi laguage
but characters wrongly read while extracting the text from pdf

`pdf_path='/content/Punjab2_new.pdf' doc = Document(pdf_path)

text_control=TextControl("physical",insert_bom=True) for page in range(len(doc)): out_res=doc[page].text((0,90,155,700),text_control) print('\n_New_page_output___\n') print(out_res)`

here are my expected and actual result images expected image is sample of my input :

expected_text

and with text function I am having false charecter recognition issue:

actual_output

PDF download.pdf

It will be a great help if any parameters of pyxpdf solve the issue