coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Weeeeiiiiirrrrdddd repetition in the title field #2

Closed avorio closed 2 years ago

avorio commented 3 years ago

Is there a way to spot these errors and remediate?

A School-to-Success Story - Prepared for TThhee UUAAWW-GGMM CCeenntteerr ffoorr HHuummaann RReessoouurrcceess - The Lansing Area Manufacturing Partnership

https://policycommons.net/artifacts/1485884/a-school-to-success-story/

PeterCiuffetti commented 3 years ago

Not sure what's happening here. I downloaded the doc, this particular doc has been prepared by Acrobat Distiller with Micorsoft Word 8 and dates to 2002.

If I copy the text on the page with my mouse and paste it somewhere, I get exactly the text shown on the page.

The Lansing Area Manufacturing Partnership A School-to-Success Story  Prepared for The UAW-GM Center for Human Resources

So maybe the PDFBox Font parsing is somehow confused by this particular font? Anyway I'll mark it a medium given the unknown origin of the error and whether it is solvable.

PeterCiuffetti commented 3 years ago

I explored this issue a bit this weekend. I am using a test document located at https://coherent-webarchive.s3.us-east-2.amazonaws.com/aed.org/democracy/publications/FinalReport.pdf

This produces a title "oatiaroatiaroatiaroatiaroatia NGONGONGONGONGO DevelopmentDevelopmentDevelopmentDevelopmentDevelopment PPPPProgramrogramrogramrogramrogram" ...via PDF Font selection on Page 1.

The actual text from page one:

image

The full text extracted by Tika is also affected. So the problem seems to be earlier in the pipeline than anything strictly isolated to heading creation. Page 1's full text looks like this on output:

September, 2001 ACADEMY FOR EDUCATIONAL DEVELOPMENT
Final Report
Contract No. EEU-C-00-98-00022-00 Submitted to the U.S. Agency for International Development by The Academy for Educational Development
roatiaroatiaroatiaroatiaroatia NGONGONGONGONGO DevelopmentDevelopmentDevelopmentDevelopmentDevelopment PPPPProgramrogramrogramrogramrogram
C

Tika uses PdfBox to do the actual text extraction. PDFBox jar comes with classes you can run from the command line. I tried both the latest version of PdfBox (3.0.0) as well as the one that Nutch has imported via Apache Tika (2.0.21). When I run PdfBox to extract the text:

java -jar ./pdfbox-app-3.0.jar export:text -console -i FinalReport.pdf
or
java -jar ./pdfbox-app.2.0.21.jar ExtractText FinalReport.pdf

... I get this for page one.

September, 2001
ACADEMY FOR EDUCATIONAL DEVELOPMENT
Final Report
Contract No. EEU-C-00-98-00022-00
Submitted to the
U.S. Agency for International Development
by
The Academy for Educational Development
roatia
NGO
Development
Program
C

So at least as output by this tool under these circumstances, there is no repetition of text.

So the really large and offset 'C' is not getting selected. Not sure why, but this is a separate issue. As a refresher, to create this heading selector, I hacked a version of Nutch's PDFParser for Tika to allow me to override the text extraction to insert font information into the output. I limited the processing to the first page of the document. And then after the parsing is done CoherenceBot hunts through the font clues to select text. (See method PDF2XHTML.java#writeString). So whatever is causing the extra text is affecting the text extraction for both the regular TikaParser as well as CoherenceBot custom Heading parser.

It's somewhat comforting to discover that PdfBox is probably not the culprit. This suggests the problem is in Tika or in Nutch. These two possibilities increase the likelihood that the problem is correctable.

PeterCiuffetti commented 3 years ago

One damaging clue is that the only text that appears to be affected by this multiplying effect is the text that was selected for the heading. Looking through the PDF, there are no exact instances of this same size and weight font being used on page 1, but there are other large fonts on pages 2 and above, and none of these are experiencing the multiplying effect. This suggests the fiend is yours truly.

The fact that this affects both the full text extracting and the heading extracting suggest perhaps a classloader issue. My fiendish overridden classes might be getting loaded by Tika when doing full text selection.

Or the heading selector is modifying the full text after extraction by tika.

Time to try a run with the heading selector turned off to see which of these is the case.

PeterCiuffetti commented 3 years ago

After a bit of time in the debugger watching Tika & PdfBox do its thing, it's become clear that the PDF has multiple copies of the strings used to display large fonts. I appears to be (in this case) outputting 5 copies of heading text sequences, moving each one slightly. So for example it outputs "roatia" at position 342.54, 436.74. And then it outputs it again shifting -0.03 (x),0 (y) and outputs it again. On the 3rd display it shifts 0.03 (x), 0.03 (y) and displays it again. And two more times with similar small adjustments to x and y. The result is that its displaying a bolder and larger font that is achievable with a single output.

I first discovered this in the debugger and then confirmed this using a PDF stream display tool at https://pdfux.com/inspect-pdf/

I have captured a section of the PDF stream and then annotated. The structure is a stack of args followed by the commands.

image

You can see the 5 copies of "roatia" in the rows that say 'az'. A normal text output to the page would be achieved with a single 'Tj' command (show glyph).

As mentioned above, PdfBox's command line tool does not exhibit this problem. The Tika parser has a local PDFTextStripper class that is doing this text extraction (by calling methods in PdfBox) and it does not appear to handle this situation, letting each copy of "roatia" out to the saved text.

So solutions to explore:

These classes are unfortunately thousand's of lines of code and many nested depths of functions from Nutch, Tika, PdfBox and FontBox, so it could take a bit of effort. But this is ultimately a solvable problem.

PeterCiuffetti commented 3 years ago

This has been fixed by refactoring the PDFHeading extractor to use PdfBox directly rather than Tika. And with the change I could override classes within PdfBox's PDFTestStripper directly.

By overriding PDFTextStripper methods directly, the new solution has some other improvements that came with this greater control over the process.

The main negative implication is that PdfBox is PDF-only while TIka was multi-format aware. The previous heading extractor though still only overrode the PDF parsing part of Tika. So it wasn't really handling any other formats anyway.

I will need to do some testing of Arabic or Hebrew documents to make sure those are not getting mangled by the token sorting. But if I run into a problem here, it should be easy to fix because we will know the language of the document at the time of parsing.