Closed csaezl closed 9 years ago
They should be Committed just one time per run, max. Can you provide enough details to reproduce?
I got this result running a configuration file twice with norconnex http-collector 2.2.0, with a slight difference between runs.
The first run with these parameters:
<numThreads>3</numThreads>
<url>http://www.agcex.org/</url>
<filter class="$filterRegexRef" onMatch="include">http://www\.agcex\.org/.*</filter>
I run the configuration file with start
and realized that there were a lot of references that were rejected, with the URL http://agcex.org/. I wanted these references to be processed so I stop
the run, and change the filter to:
<filter class="$filterRegexRef" onMatch="include">http://.*agcex\.org/.*</filter>
I know it is not the most accurate expression as a filter for just detecting subdomains or www., but it served for the purpose. So I run the configuration file again, with start
.
The new references (without www.) were not rejected this time but I noticed a new issue, that is, a lot of references were duplicated. They were duplicated 2, 3 and more times for the events I mentioned in my previous post.
Some samples of references with duplicates:
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=month&cal_d=1&cal_m=1&cal_y=1057
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/help.php?module=assignment&file=allowdeleting.html&forcelang=en_utf8
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming&course=1
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
So you see this sample line appear multiple times exactly as is in the same log file?
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
Changing the config and rerunning over a crawl database created with a different config can produce unexpected results. Whenever you change the config, it is always best to clear your working directory (crawl db) and start fresh. Can you do this and see if it happens again?
This is an excerpt from the second run log for http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-07 12:54:25 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-07 12:54:25 INFO - URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-07 12:54:25 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-07 12:54:25 INFO - URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO - DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
...
MC (crawler): 2015-08-08 00:01:19 INFO - DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO - CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO - REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-08 00:01:19 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-08 00:01:19 INFO - URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO - DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO - DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO - DOCUMENT_FETCHED: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:19 INFO - CREATED_ROBOTS_META: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: https://www.facebook.com/AGCEX
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: http://www.cyxmedia.com/funkyjoomla
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: https://twitter.com/AgcexOrg
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: https://plus.google.com/u/0/+JavierMendozacyxmedia
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-08 00:01:20 INFO - URLS_EXTRACTED: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO - DOCUMENT_IMPORTED: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO - DOCUMENT_COMMITTED_ADD: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO - DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO - CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-08 00:01:20 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-08 00:01:20 INFO - URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO - DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO - DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
...
MC (crawler): 2015-08-09 00:00:19 INFO - DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-09 00:00:19 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-09 00:00:19 INFO - URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-09 00:00:19 INFO - REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-09 00:00:19 INFO - URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO - DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
It's hard to understand that just changing the filter in the second run (with start
) you can get unexpected results. Is this really the way HTTP Collector behaves?.
Having cleared the working directory, I get the same result, that is, URLs processed several times. I have an small log produced with 2.3.0-SNAPSHOT, just in case you wanted to have a look at it.
Unless my system has become crazy, you should get the same result with the URL and filter I've used for testing.
Please provide your full config. I could not reproduce with 2.3.0-SNAPSHOT with a maxDocuments of 100..
I followed your steps. Started with your first filter and stopped it after some time. Then I modified to have your second filter instead and started it again with the action "start" (as opposed to "resume").
I found no duplicates DOCUMENT_COMMITTED_ADD in the first log, nor the second.
Does it happen after a lot of documents in your case? How many?
In my case, I've seen the duplicates for http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
in the log on line 32xx (I've run the test twice). So, to be sure, you should run the crawler until you get 4000 lines in the log.
Here is my config file:
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="MC (collector)">
#set($filterRegexRef = "com.norconex.collector.core.filter.impl.RegexReferenceFilter")
#set($workdir = "C:/CRAWLER/collectors/MC/work/")
<progressDir>$workdir/progress</progressDir>
<logsDir>$workdir/log</logsDir>
<crawlerDefaults>
<delay default="100" />
<numThreads>3</numThreads>
<maxDepth>-1</maxDepth>
<maxDocuments>-1</maxDocuments>
<keepDownloads>false</keepDownloads>
<orphansStrategy>IGNORE</orphansStrategy>
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>removeDotSegments</normalizations>
</urlNormalizer>
<canonicalLinkDetector
ignore="false">
</canonicalLinkDetector>
</crawlerDefaults>
<crawlers>
<crawler id="MC (crawler)">
<robotsTxt ignore="true" />
<sitemap ignore="true" />
<httpClientFactory>
<trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
<workDir>$workdir</workDir>
<startURLs>
<url>http://www.agcex.org/</url>
</startURLs>
<referenceFilters>
<!-- First run -->
<!-- <filter class="$filterRegexRef" onMatch="include">http://www\.agcex\.org/.*</filter> -->
<!-- Second run -->
<!-- <filter class="$filterRegexRef" onMatch="include">http://.*agcex\.org/.*</filter> -->
</referenceFilters>
<importer>
<!-- max memory used for a single file. 10MB by default -->
<maxFileCacheSize>10000000</maxFileCacheSize>
<!-- max memory for the sum of all files. 100MB by default -->
<maxFilePoolCacheSize>100000000</maxFilePoolCacheSize>
</importer>
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/MC</solrURL>
<sourceReferenceField keep="false">document.reference</sourceReferenceField>
<targetReferenceField>id</targetReferenceField>
<targetContentField>content</targetContentField>
<commitBatchSize>10</commitBatchSize>
<queueDir>$workdir/queue</queueDir>
<queueSize>100</queueSize>
<maxRetries>2</maxRetries>
<maxRetryWait>5000</maxRetryWait>
<solrUpdateURLParams>
<param name="update.chain">langid</param>
</solrUpdateURLParams>
</committer>
</crawler>
</crawlers>
</httpcollector>
Does it only happen when you stop, change your rules, and start again? Or does it happen on a "normal" run as well? In other words, does the duplicates always appear for you no matter what?
Yes, that's the matter. It's not necessary to run the first run. You get the duplicates just with the second run, that is with <filter class="$filterRegexRef" onMatch="include">http://.*agcex\.org/.*</filter>
.
I've tried with 1 thread but it also produces duplicates. So, I presume the fault is in the filter. Is it possible?
I was able to reproduce and found the cause. It is a case of different URLs causing redirects to the same target URL. Since they are initially different, the crawler things they are different. I actually thought this case was already handled. I will investigate.
Two different URLs example (showglobal vs showcourses):
They both redirect to: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
Thank you for your support. I'm glad to read you were able to reproduce the issue.
I just created a new 2.3.0 snapshot release. It fixes the issue with many URL redirecting to the same URL causing commit duplicates.
Please try and let me know.
It seems to work. I've searched for duplicates for several URLs that duplicated on the previous version, and now they are committed just once.
Great. Thanks for confirming.
After running a crawler with
<numThreads>3</numThreads>
and just one URL, I have analysed the log and noticed that several URL are processed several times via the events:DOCUMENT_FETCHED, CREATED_ROBOTS_META, URLS_EXTRACTED, DOCUMENT_IMPORTED, DOCUMENT_COMMITTED_ADD
. That is, they are commited several times in the same run.Is there anything I'm doing wrong?. Is there a way of preventing it and committing it just one time?