Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

WebDriverHttpFetcher causes TikaException on PDF files #776

Closed sylvainroussy closed 2 years ago

sylvainroussy commented 2 years ago

Hi!

Using WebDriverHttpFetcher, importer crashes when parsing PDF files (it's ok with GenericHttpFetcher), due to a Tika exception:

Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@10a6d7
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:481)
    at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:158)

Url to try it: https://www.basf.com/global/documents/en/news-and-media/publications/reports/2021/BASF_in_India_Factsheet_2020.pdf

Base Url needing WebDriverHttpFetcher : https://www.basf.com/fr/fr/media/publications.html

A bit of my Java code:

final WebDriverHttpFetcherConfig webDriverConfig = new WebDriverHttpFetcherConfig();
            webDriverConfig.setThreadWait(1000L);
            webDriverConfig.setWaitForElementTimeout(1000L);            
            webDriverConfig.setBrowser(Browser.CHROME);

            final HttpSnifferConfig snifferConfig = new HttpSnifferConfig();
            snifferConfig.setUserAgent(oxwayFetchConfiguration.getUserAgent());

            webDriverConfig.setHttpSnifferConfig(snifferConfig);        

            webDriverConfig.setDriverPath(Paths.get("/data1/tools/chromedriver"));
            final WebDriverHttpFetcher webDriver = new WebDriverHttpFetcher(webDriverConfig);
essiembre commented 2 years ago

WebDriverHttpFetcher is meant to be used to grab HTML pages. Web drivers are used by the crawler to access a generated page DOM model when important parts of it are generated on the client-side. PDFs are not considered "web pages" and have no such DOM model returned by the browser. What you would get instead (if anything) is not a PDF binary and would likely be the cause of the exception you are getting (Tika trying to parse a PDF that is not really a PDF).

Web drivers implementations have different levels of support for "downloads". Through additional web driver settings and/or possibly some code injection to simulate a user click, people have had varying levels of success saving files to a specific local file system folder.

Before you go that route I would recommend you limit the use of WebDriverHttpFetcher to the specific HTML pages that need it and use the GenericHttpFetcher for the rest. With version 3.x you can define multiple fetchers and have rules you can configure to select which one is used. If you have many non-JavaScript generated pages, you will also make your overall crawling significantly faster that way. The following config snippet shows how you can have multiple fetchers:

<!-- You can optionally retry a failing fetcher (e.g., request timeout) -->
<httpFetchers maxRetries="1" retryDelay="5 seconds">

  <!-- Example fetcher used for all but PDFs -->

  <fetcher class="WebDriverHttpFetcher">
    <referenceFilters>
      <filter class="ReferenceFilter" onMatch="include">
        <valueMatcher method="regex">
          ^https://www.example.com/.*
        </valueMatcher>
      </filter>
      <filter class="ReferenceFilter" onMatch="exclude">
        <valueMatcher method="regex">
          .*\.pdf$
        </valueMatcher>
      </filter>
    </referenceFilters>
    <!-- ... more fetcher config here ... -->
  </fetcher>

  <!-- Only invoked if the previous one was not accepted/successful -->

  <fetcher class="GenericHttpFetcher">
    <referenceFilters>
      <filter class="ReferenceFilter">
        <valueMatcher method="regex"  onMatch="include">
          ^https://www.example.com/some/pattern.*\.html$
        </valueMatcher>
      </filter>
    </referenceFilters>
    <!-- ... more fetcher config here ... -->
  </fetcher>

</httpFetchers>

The main issue with such an approach is if the non-HTML files you are targeting with the GenericHttpFetcher (e.g., PDFs) require authentication or another state previously established by the web driver fetcher. For instance, if you use the WebDriverHttpFetcher to log in then switching to GenericHttpFetcher typically won't do it for protected content. You could have the GenericHttpFetcher login again, but if that's not possible, you are out of luck and you'll have to pull your coding/scripting skills for a workaround.

Does that help?

sylvainroussy commented 2 years ago

Thank you Pascal, I wrote a workaround related to your response and it works.