Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Concatenated first line with certain PDF's #12

Closed OkkeKlein closed 9 years ago

OkkeKlein commented 9 years ago

This issue derailed the OOM discussion, so a new issue was created.

I added some logging to the ReplaceTransformer and found out that certain PDF's have a concatenated string of the first line (7 times) when delivered to the transformer. After that the content is normal.

Using command line (pdfbox-app) the content shows normal 1 time first line.

I added the file to Dropbox.

OkkeKlein commented 9 years ago

Were you able to reproduce this?

essiembre commented 9 years ago

I re-downloaded that file and was able to reproduce this time. Not sure why, but the file I previously downloaded was twice the size, and was corrupted, but did not have that issue. I'll investigate.

essiembre commented 9 years ago

I found the cause. Apparently some PDFs will rewrite the same text in the same area multiple times to bold that text (see PDFBOX-956 and PDFBOX-1155 for more details on this behavior).

Luckily there is a fix, but unfortunately, there are no simple configuration flag for this. No worries, the following is a solution that has been tested to work.

You can tell PDFBox to suppress such duplicate text using Java. To do so we'll use an approach similar to the one described in https://github.com/Norconex/importer/issues/9#issuecomment-97955719. Create a new class called CustomDocumentParserFactory, with this code in it:

import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParserConfig;

import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.FallbackParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {
    protected IDocumentParser createFallbackParser() {
        FallbackParser parser = new FallbackParser() {
            @Override
            protected void modifyParseContext(ParseContext context) {
                PDFParserConfig config = context.get(PDFParserConfig.class);
                if (config == null) {
                    config = new PDFParserConfig();
                }
                // the magic fix is this line: 
                config.setSuppressDuplicateOverlappingText(true);
                context.set(PDFParserConfig.class, config);
            }
        };
        parser.setSplitEmbedded(isSplitEmbedded());
        parser.setOCRConfig(getOCRConfig());
        return parser;
    }
}

Once this class is added to your classpath, make sure to reference it in your configuration like this:

<importer>
    ...
    <!-- prepend the class name by your package name if you declared any -->
    <documentParserFactory class="CustomDocumentParserFactory" />
    ...
</importer>

That will get rid of this issue. We plan to make it easier to configure modifiable parsers in a future release.

essiembre commented 9 years ago

The default behavior is now to always suppress duplicate overlapping text. This change is in the new Importer 2.2.0 stable release.