ignoreExternalLinks="true" processes ExternalLinks

csaezl commented 9 years ago

I have a collector with <extractor ignoreExternalLinks="true" /> that processes the URL http://www.fexb.es/. At some time it processes https://www.facebook.com/r.php?locale=es_ES web page and others from facebook site.

There is another collector running at the same time facing different URLs.

essiembre commented 9 years ago

If your configs points to different paths for all path related settings, there is no reason why it should cause the ignoreExternalLinks feature to fail. Can you share your config to help reproduce? And after how many documents roughly does it start crawling Facebook?

essiembre commented 9 years ago

Just a thought, extractors needs to be wrapped in a linkExtractors tag. In other words, do you have it this way?

<linkExtractors>
    <extractor ignoreExternalLinks="true" />
</linkExtractors>

csaezl commented 9 years ago

The first page from facebook, that is really: DOCUMENT_FETCHED: https://www.facebook.com/fexbfederacionextremenabaloncesto appears after 4% completed (38 processed/796 total) message plus 4 DOCUMENT_FETCHED events more

Here is the config file:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="MC (collector)">

  #set($filterRegexRef  = "com.norconex.collector.core.filter.impl.RegexReferenceFilter")
  #set($workdir = "D:/CRAWLER-MC/collectors/MC/work/")

  <progressDir>$workdir/progress</progressDir>
  <logsDir>$workdir/log</logsDir>

  <crawlerDefaults>
      <delay default="100" />
      <numThreads>3</numThreads>
      <maxDepth>-1</maxDepth>
      <maxDocuments>-1</maxDocuments>
      <keepDownloads>false</keepDownloads>
      <orphansStrategy>IGNORE</orphansStrategy>
      <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>removeDotSegments</normalizations>
      </urlNormalizer>    
      <canonicalLinkDetector ignore="true"> </canonicalLinkDetector>
  </crawlerDefaults>

  <crawlers>
    <crawler id="MC (crawler)">
      <robotsTxt ignore="true" />
      <sitemap ignore="true" />
      <httpClientFactory>
            <trustAllSSLCertificates>true</trustAllSSLCertificates>
      </httpClientFactory>
      <workDir>$workdir</workDir>
      <linkExtractors>
        <extractor ignoreExternalLinks="true" />
      </linkExtractors>   
      <startURLs>
        <url>http://www.fexb.es/</url>      
      </startURLs>

      <importer>
        <!-- max memory used for a single file. 10MB by default -->
        <maxFileCacheSize>10000000</maxFileCacheSize>
        <!-- max memory for the sum of all files.  100MB by default -->
        <maxFilePoolCacheSize>100000000</maxFilePoolCacheSize>  
      </importer>

      <committer class="com.norconex.committer.solr.SolrCommitter">
        <solrURL>http://localhost:8983/solr/MC</solrURL>
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>content</targetContentField>
        <commitBatchSize>10</commitBatchSize>
        <queueDir>$workdir/queue</queueDir>
        <queueSize>100</queueSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
        <commitDisabled>true</commitDisabled>

        <solrUpdateURLParams>
          <param name="update.chain">langid</param>
        </solrUpdateURLParams>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>

essiembre commented 9 years ago

I was able to reproduce the problem with your confirm, and I have nailed down the culprit: http://www.fexb.es/index.php/component/banners/click/15

That URL produces a redirect to this Facebook page: https://www.facebook.com/fexbfederacionextremenabaloncesto

Because external URLs are excluded after they were extracted from a page, but before they are being fetched. It is not excluded because it is valid at that time. It is only once it is fetched that the redirect takes place and the URL changes to a Facebook one.

We need to revisit how this feature is implemented. Since the "intent" of the feature is not respected, I will mark as a bug.

It may take some time to find a clean solution so in the meantime, I recommend you add a RegexMetadataFilter on the field document.reference to stick to your domain (this filter is triggered after a document is downloaded). It is best that you do it as a preParseHandlers.

csaezl commented 9 years ago

Thank you for your support.

I've tried your advice but, although the filtered URLs are not commited, they are almost fully processed, spending a lot of time.

So, until you get the right solution, I'll use <referenceFilters> again

essiembre commented 9 years ago

<referenceFilters> won't do it in this case unless you filter out the .../component/banners/click/15 URL specifically.

The <referenceFilters> takes place before a page gets downloaded (to save bandwidth and speed things up). So at that time the URL will be valid (starting with http://www.fexb.es/) and it will be downloaded. It is when the download occurs that the URL changes because of the redirect. So using <referenceFilters> will not reject the Facebook page.

My recommendation will filter it after download, but before parsing. You save the parsing. Yes, the document is still downloaded, but that's a way that guarantees it does not get committed.

If you know your site does not do internal redirects (redirect to a page within the same domain), a simpler workaround for now would be to tell the crawler to not follow redirects by setting maxRedirects to 0 in your <httpClientFactory>, like this:

<httpClientFactory>
    <maxRedirects>0</maxRedirects>
    <trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>

csaezl commented 9 years ago

What I wanted to mean about using <referenceFilters> again is that it is faster.

You get an unwanted commit for https://www.facebook.com/fexbfederacionextremenabaloncesto but the rest of references to FACEBOOK are REJECTED_FILTER.

essiembre commented 9 years ago

I have added a new configurable flag to GenericHttpClientFactory called ignoreExternalRedirects. You can set it up like this:

<httpClientFactory>
    <ignoreExternalRedirects>true</ignoreExternalRedirects>
    ...
</httpClientFactory>

With this flag, it will not proceed with a redirect if the target scheme, host name, or port is different. In such cases, it will log an INFO message about it being ignored (and the HTTP status being unsupported).

This approach ensures the redirect target never gets downloaded if not on the same site.

So I recommend you use this in combination with <extractor ignoreExternalLinks="true" /> that you are already using and your issue with Facebook being crawled should be gone.

You'll need the latest snapshot release.

Please try and confirm.

csaezl commented 9 years ago

I've tried the new snapshot and got:

URL Redirect: http://www.fexb.es/index.php/component/banners/click/15 -> https://www.facebook.com/fexbfederacionextremenabaloncesto
...
MC (crawler): 2015-09-07 12:01:37 INFO -         DOCUMENT_IMPORTED: https://www.facebook.com/fexbfederacionextremenabaloncesto
MC (crawler): 2015-09-07 12:01:37 INFO -    DOCUMENT_COMMITTED_ADD: https://www.facebook.com/fexbfederacionextremenabaloncesto

I expected https://www.facebook.com/fexbfederacionextremenabaloncesto not having been committed.

csaezl commented 9 years ago

Sorry, ignore my previous post. I hadn't modified the config file properly

csaezl commented 9 years ago

It works fine:

MC (crawler): 2015-09-07 15:40:12 INFO - Ignoring external redirect: http://www.fexb.es/index.php/component/banners/click/15 -> https://www.facebook.com/fexbfederacionextremenabaloncesto
MC (crawler): 2015-09-07 15:40:12 INFO -       REJECTED_BAD_STATUS: http://www.fexb.es/index.php/component/banners/click/15

essiembre commented 9 years ago

Great. Thanks for confirming.

Norconex / crawlers

ignoreExternalLinks="true" processes ExternalLinks #138