axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Apache License 2.0
5.76k stars 306 forks source link

No text detected in pdf #620

Open rafaeldepablo opened 1 year ago

rafaeldepablo commented 1 year ago

SABADELL_GOBIERNO_CORPORATIVO_2022.pdf Summary great software

I'm running into strange behavior on some pdfs, apparently it's not finding any text except on the first sheet.

The pdf files are normal, it is possible to copy the text and search.

Instead if it finds the tables even though the text is blank.

Steps To Reproduce

Load the pdf and try

Expected behavior The text is processed

Actual behavior No text is identified

Screenshots image

Environment

sudo docker run -p 3001:3001 axarev/parsr:latest

Thanks in advance

NgoDuyVu1993 commented 1 year ago

Hi @rafaeldepablo, Ignore my comment if you find it irrelevant. I am not in Parsr team, I have some problem with Table detection so I looked around to see if anyone have the same. I tried to run your document, Parsr can detect fine with your document.

image

image

I think you may missed something when you do the setting when you uploaded document. Here is how I configured

image

rafaeldepablo commented 1 year ago

Thanks

I tried again and it crashed, but I retried again and it worked.

Regards

rafa