Closed csaezl closed 9 years ago
If your configs points to different paths for all path related settings, there is no reason why it should cause the ignoreExternalLinks
feature to fail. Can you share your config to help reproduce? And after how many documents roughly does it start crawling Facebook?
Just a thought, extractors needs to be wrapped in a linkExtractors
tag. In other words, do you have it this way?
<linkExtractors>
<extractor ignoreExternalLinks="true" />
</linkExtractors>
The first page from facebook, that is really:
DOCUMENT_FETCHED: https://www.facebook.com/fexbfederacionextremenabaloncesto
appears after 4% completed (38 processed/796 total)
message plus 4 DOCUMENT_FETCHED
events more
Here is the config file:
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="MC (collector)">
#set($filterRegexRef = "com.norconex.collector.core.filter.impl.RegexReferenceFilter")
#set($workdir = "D:/CRAWLER-MC/collectors/MC/work/")
<progressDir>$workdir/progress</progressDir>
<logsDir>$workdir/log</logsDir>
<crawlerDefaults>
<delay default="100" />
<numThreads>3</numThreads>
<maxDepth>-1</maxDepth>
<maxDocuments>-1</maxDocuments>
<keepDownloads>false</keepDownloads>
<orphansStrategy>IGNORE</orphansStrategy>
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>removeDotSegments</normalizations>
</urlNormalizer>
<canonicalLinkDetector ignore="true"> </canonicalLinkDetector>
</crawlerDefaults>
<crawlers>
<crawler id="MC (crawler)">
<robotsTxt ignore="true" />
<sitemap ignore="true" />
<httpClientFactory>
<trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
<workDir>$workdir</workDir>
<linkExtractors>
<extractor ignoreExternalLinks="true" />
</linkExtractors>
<startURLs>
<url>http://www.fexb.es/</url>
</startURLs>
<importer>
<!-- max memory used for a single file. 10MB by default -->
<maxFileCacheSize>10000000</maxFileCacheSize>
<!-- max memory for the sum of all files. 100MB by default -->
<maxFilePoolCacheSize>100000000</maxFilePoolCacheSize>
</importer>
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/MC</solrURL>
<sourceReferenceField keep="false">document.reference</sourceReferenceField>
<targetReferenceField>id</targetReferenceField>
<targetContentField>content</targetContentField>
<commitBatchSize>10</commitBatchSize>
<queueDir>$workdir/queue</queueDir>
<queueSize>100</queueSize>
<maxRetries>2</maxRetries>
<maxRetryWait>5000</maxRetryWait>
<commitDisabled>true</commitDisabled>
<solrUpdateURLParams>
<param name="update.chain">langid</param>
</solrUpdateURLParams>
</committer>
</crawler>
</crawlers>
</httpcollector>
I was able to reproduce the problem with your confirm, and I have nailed down the culprit: http://www.fexb.es/index.php/component/banners/click/15
That URL produces a redirect to this Facebook page: https://www.facebook.com/fexbfederacionextremenabaloncesto
Because external URLs are excluded after they were extracted from a page, but before they are being fetched. It is not excluded because it is valid at that time. It is only once it is fetched that the redirect takes place and the URL changes to a Facebook one.
We need to revisit how this feature is implemented. Since the "intent" of the feature is not respected, I will mark as a bug.
It may take some time to find a clean solution so in the meantime, I recommend you add a RegexMetadataFilter on the field document.reference
to stick to your domain (this filter is triggered after a document is downloaded). It is best that you do it as a preParseHandlers
.
Thank you for your support.
I've tried your advice but, although the filtered URLs are not commited, they are almost fully processed, spending a lot of time.
So, until you get the right solution, I'll use <referenceFilters>
again
<referenceFilters>
won't do it in this case unless you filter out the .../component/banners/click/15
URL specifically.
The <referenceFilters>
takes place before a page gets downloaded (to save bandwidth and speed things up). So at that time the URL will be valid (starting with http://www.fexb.es/
) and it will be downloaded. It is when the download occurs that the URL changes because of the redirect. So using <referenceFilters>
will not reject the Facebook page.
My recommendation will filter it after download, but before parsing. You save the parsing. Yes, the document is still downloaded, but that's a way that guarantees it does not get committed.
If you know your site does not do internal redirects (redirect to a page within the same domain), a simpler workaround for now would be to tell the crawler to not follow redirects by setting maxRedirects
to 0 in your <httpClientFactory>
, like this:
<httpClientFactory>
<maxRedirects>0</maxRedirects>
<trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
What I wanted to mean about using <referenceFilters>
again is that it is faster.
You get an unwanted commit for https://www.facebook.com/fexbfederacionextremenabaloncesto
but the rest of references to FACEBOOK are REJECTED_FILTER
.
I have added a new configurable flag to GenericHttpClientFactory called ignoreExternalRedirects
. You can set it up like this:
<httpClientFactory>
<ignoreExternalRedirects>true</ignoreExternalRedirects>
...
</httpClientFactory>
With this flag, it will not proceed with a redirect if the target scheme, host name, or port is different. In such cases, it will log an INFO message about it being ignored (and the HTTP status being unsupported).
This approach ensures the redirect target never gets downloaded if not on the same site.
So I recommend you use this in combination with <extractor ignoreExternalLinks="true" />
that you are already using and your issue with Facebook being crawled should be gone.
You'll need the latest snapshot release.
Please try and confirm.
I've tried the new snapshot and got:
URL Redirect: http://www.fexb.es/index.php/component/banners/click/15 -> https://www.facebook.com/fexbfederacionextremenabaloncesto
...
MC (crawler): 2015-09-07 12:01:37 INFO - DOCUMENT_IMPORTED: https://www.facebook.com/fexbfederacionextremenabaloncesto
MC (crawler): 2015-09-07 12:01:37 INFO - DOCUMENT_COMMITTED_ADD: https://www.facebook.com/fexbfederacionextremenabaloncesto
I expected https://www.facebook.com/fexbfederacionextremenabaloncesto
not having been committed.
Sorry, ignore my previous post. I hadn't modified the config file properly
It works fine:
MC (crawler): 2015-09-07 15:40:12 INFO - Ignoring external redirect: http://www.fexb.es/index.php/component/banners/click/15 -> https://www.facebook.com/fexbfederacionextremenabaloncesto
MC (crawler): 2015-09-07 15:40:12 INFO - REJECTED_BAD_STATUS: http://www.fexb.es/index.php/component/banners/click/15
Great. Thanks for confirming.
I have a collector with
<extractor ignoreExternalLinks="true" />
that processes the URLhttp://www.fexb.es/
. At some time it processeshttps://www.facebook.com/r.php?locale=es_ES
web page and others from facebook site.There is another collector running at the same time facing different URLs.