Closed dkaluza closed 1 month ago
Hey, thanks for the effort!
3 out of 4 of my test .pdf files are now working perfectly.
The attached jspdf still produces the error "ToUnicode CMap error: Could not parse ToUnicodeCMap: Error!" pdf2.pdf
Works quite ok for me, are you using proposed extract_text_chunks
?
Output I received:
[Err(ToUnicodeCMap(Parse(Error))), Ok("Features: \n"), Ok("- different "), Ok("font "), Ok("styling "), Ok("options \n"), Ok("-Images(JPEGs,otherPDFs) \n"), Ok("-Tables(fixedlayout,headerrow) \n"), Ok("-AFMfontsand \n"), Ok("\n"), Ok("-AddexistingPDFs(mergethemoraddthemaspagetemplates) \n"), Ok("Formoreinformationvisitthe Documentation \n"), Ok("2 \n"), Ok("pc. \n"), Ok("ArticleA "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("500.00€ \n"), Ok("1000.00€ \n"), Ok("1 \n"), Ok("pc. \n"), Ok("ArticleB "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("250.00€ \n"), Ok("250.00€ \n"), Ok("2 \n"), Ok("pc. \n"), Ok("ArticleC "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("330.00€ \n"), Ok("660.00€ \n"), Ok("3 \n"), Ok("pc. \n"), Ok("ArticleD "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("1220.00 € \n"), Ok("3660.00€ \n"), Ok("2 \n"), Ok("pc. \n"), Ok("ArticleE "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n"), Ok("120.00€ \n"), Ok("240.00€ \n"), Ok("250.00€ \n"), Ok("50.00€ \n"), Ok("pc. \n"), Ok("5 \n"), Ok("ArticleF "), Ok("Loremipsumdolorsitamet,consecteturadipiscingelit.Cumid \n"), Ok("fugiunt,reeademquaePeripatetici,verba.Tenesneigitur,inquam, HieronymusRhodiusquiddicatessesummumbonum,quoputet omniareferrioportere?Quianechonestoquicquamhonestiusnec turpiturpius. \n")]
The Err(ToUnicodeCMap(Parse(Error)))
is still there, as the CMap is not following the spec. But the text which got parsed properly is available in other vector entries.
There is some issue with spaces in this parsed pdf (I have heard that some pdf writers are using positions between text operators instead of regular space characters, which might be the case here), but it looks like entirely separate issue for me that would require additional work on text chunks positions determination. (the same problem was there when I have manually fixed the CMap to follow the spec and parse text with extract_text
from main)
Ah yeah, I was just blind, to not see that the function is public. It works like a charm now. Thank you for the support and implementation once again. Looking forward to have this PR merged.
As mentioned in issue, current
extract_text
implementation leads to Err even in situations that multiple text fragments can be properly extracted.This PR aims to improve on that by implementing a more robust
extract_text_chunks
, that will return text fragments even if some errors occurred.Additionally some basic extract text tests have been added.