Links not being extracted from PDF documents

douglas-andrew-harley commented 6 years ago

Hello,

The Norconex stack is by far the best crawler+importer technology around, much thanks for your excellent works!

Over the past couple days I have been developing a tool to generate a broken links report for a website, and after overcoming a few missteps (e.g, needed to use Phantom JS to extract dynamically-generated links)things seems to be working really well...but I have just discovered that links are not being extracted from PDF files for some reason. I can see that the PDFs are being retrieved and processed, but the set of extracted links is always empty. Here are relevant lines from the logging output, and my complete cfg is below, at bottom:

ACCEPTED document reference. Reference=<PDF URL>
Queued for processing: <PDF URL>
URLS_EXTRACTED: <PDF URL> (Subject: [])

Any idea what might be going on, or how I can get these links extracted also?

Thanks in advance, Douglas Harley

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="${projectName} HTTP Collector">

  <progressDir>${workDir}/progress</progressDir>
  <logsDir>${workDir}/logs</logsDir>

  <crawlers>

    <crawler id="${projectName} crawler">

      <crawlerListeners>
        <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
          <statusCodes>100-599</statusCodes>
          <outputDir>URLStatusCrawlerEventListener_output</outputDir>
          <fileNamePrefix>brokenLinks</fileNamePrefix>
        </listener>
      </crawlerListeners>

      <startURLs stayOnDomain="false" stayOnPort="true" stayOnProtocol="false">
        <url>${url}</url>
      </startURLs>

      <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>removeDotSegments,removeFragment</normalizations>
      </urlNormalizer>

      <orphansStrategy>DELETE</orphansStrategy>

      <userAgent>${projectName} crawler agent</userAgent>

      <numThreads>${numThreads}</numThreads>

      <workDir>${workDir}</workDir>

      <maxDepth>${maxDepth}</maxDepth>

      <keepDownloads>${keepDownloads}</keepDownloads>

      <robotsTxt ignore="true"/>

      <robotsMeta ignore="true"/>

      <sitemap ignore="true"/>

      <sitemapResolverFactory ignore="true"/>

      <delay default="${delayMs}"/>

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpg,gif,png,ico,css,js</filter>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">.*${url}.*</filter>
      </referenceFilters>

      <importer>
        <parseErrorsSaveDir>${workDir}/parse-errors</parseErrorsSaveDir>
      </importer>

      <httpClientFactory>
        <connectionTimeout>${connectionTimeout}</connectionTimeout>
        <maxConnections>${maxConnections}</maxConnections>
      </httpClientFactory>

      <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
        <exePath>${phantomjsExecutablePath}</exePath>
      </documentFetcher>

    </crawler>

  </crawlers>

</httpcollector>

essiembre commented 6 years ago

Thanks for sharing your appreciation! Let's spread the word! :-)

Your question is a challenging one. URLs are currently extracted from a document before it is imported/parsed. There are a few reasons for this, one of which is to be able to reject a document after URLs were extracted and before importing/parsing occurs (to save that processing).

For this reason, the focus has been on text-based link extractors for now. If you know your Java, nothing prevents you from writing your own ILinkExtractor that extracts links from PDFs.

I have been thinking for some time to offer the ability to define link extractors after importing/parsing occurred.

It would rely on matching URL patterns only at that point since markup will be lost. As such, we may have trouble identifying relative URLs after parsing, but it should only affect HTML pages and those are already covered properly (by the GenericLinkExtractor). Maybe there should even be a new post-import link extractor that is always set by default so that URLs from any content types are caught.

I'll mark this as a feature request.

I'd like to hear your thoughts on this.

douglas-andrew-harley commented 6 years ago

Hi Pascal,

Thanks for responding so promptly. Yes, I am spreading the word indeed! :)

Aha, that makes sense...I appreciate the explanation, and I will develop a custom ILinkExtractor implementation for PDFs and other docs (we also need doc/x, ppt/x xls/x files to get their links extracted).

It seems that the link extraction could be done in 2 different modes, the text-based mode as is now implemented, and a text+binary mode where the parsing is also done during the crawl on binary files. Then, when the import process starts (if defined in the project) it can skip the parsing of the binary files, because those should already have metadata extracted, and only the text-based files would need to be done, so double-parsing is avoided and total processing time should be about the same for users that do want to import. Also, you could parameterize the link extraction so users that don't need the binary extraction can just use the current system, and avoid the parsing if they aren't importing. Just some thoughts off the top-o'-me-head...

Cheers, Doug

essiembre commented 4 years ago

Version 3.0.0 adds a new postImportLinks configuration option that works in addition to existing link extractors. It allows specifying any metadata field as containing URLs to be crawled once importing (document parsing) has been done. This means you can use the Importer module capabilities to extract/create URLs to be crawled.

To help with that, a simple URLExtractorTagger has been added to the Importer module to extract URLs from any plain text. This is useful as a post-parse handler for extracting links from any binary documents, such as PDF.

Norconex / crawlers

Links not being extracted from PDF documents #428