Add function for text chunks extraction.

dkaluza commented 1 month ago

As mentioned in issue, current extract_text implementation leads to Err even in situations that multiple text fragments can be properly extracted.

This PR aims to improve on that by implementing a more robust extract_text_chunks, that will return text fragments even if some errors occurred.

Additionally some basic extract text tests have been added.

Roba1993 commented 1 month ago

Hey, thanks for the effort!

3 out of 4 of my test .pdf files are now working perfectly.

The attached jspdf still produces the error "ToUnicode CMap error: Could not parse ToUnicodeCMap: Error!" pdf2.pdf

dkaluza commented 1 month ago

Works quite ok for me, are you using proposed extract_text_chunks? Output I received:

[Err(ToUnicodeCMap(Parse(Error))), Ok("Features: \n"), Ok("- different "), Ok("font "), Ok("styling "), Ok("options \n"), Ok("-Images(JPEGs,otherPDFs) \n"), Ok("-Tables(fixedlayout,headerrow) \n"), Ok("-AFMfontsand \n"), Ok("\n"), Ok("-AddexistingPDFs(mergethemoraddthemaspagetemplates) \n"), Ok("Formoreinformationvisitthe Documentation \n"), Ok("2 \n"), Ok("pc. \n"), Ok("ArticleA "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("500.00€ \n"), Ok("1000.00€ \n"), Ok("1 \n"), Ok("pc. \n"), Ok("ArticleB "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("250.00€ \n"), Ok("250.00€ \n"), Ok("2 \n"), Ok("pc. \n"), Ok("ArticleC "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("330.00€ \n"), Ok("660.00€ \n"), Ok("3 \n"), Ok("pc. \n"), Ok("ArticleD "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("1220.00 € \n"), Ok("3660.00€ \n"), Ok("2 \n"), Ok("pc. \n"), Ok("ArticleE "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("120.00€ \n"), Ok("240.00€ \n"), Ok("250.00€ \n"), Ok("50.00€ \n"), Ok("pc. \n"), Ok("5 \n"), Ok("ArticleF "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid \n"), Ok("fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n")]

The Err(ToUnicodeCMap(Parse(Error))) is still there, as the CMap is not following the spec. But the text which got parsed properly is available in other vector entries.

There is some issue with spaces in this parsed pdf (I have heard that some pdf writers are using positions between text operators instead of regular space characters, which might be the case here), but it looks like entirely separate issue for me that would require additional work on text chunks positions determination. (the same problem was there when I have manually fixed the CMap to follow the spec and parse text with extract_text from main)

Roba1993 commented 1 month ago

Ah yeah, I was just blind, to not see that the function is public. It works like a charm now. Thank you for the support and implementation once again. Looking forward to have this PR merged.

J-F-Liu / lopdf

Add function for text chunks extraction. #342