Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

One URL COMMITED several times in a crawler run #135

Closed csaezl closed 9 years ago

csaezl commented 9 years ago

After running a crawler with <numThreads>3</numThreads> and just one URL, I have analysed the log and noticed that several URL are processed several times via the events: DOCUMENT_FETCHED, CREATED_ROBOTS_META, URLS_EXTRACTED, DOCUMENT_IMPORTED, DOCUMENT_COMMITTED_ADD. That is, they are commited several times in the same run.

Is there anything I'm doing wrong?. Is there a way of preventing it and committing it just one time?

essiembre commented 9 years ago

They should be Committed just one time per run, max. Can you provide enough details to reproduce?

csaezl commented 9 years ago

I got this result running a configuration file twice with norconnex http-collector 2.2.0, with a slight difference between runs.

The first run with these parameters:

<numThreads>3</numThreads>
<url>http://www.agcex.org/</url>
<filter class="$filterRegexRef" onMatch="include">http://www\.agcex\.org/.*</filter>

I run the configuration file with start and realized that there were a lot of references that were rejected, with the URL http://agcex.org/. I wanted these references to be processed so I stop the run, and change the filter to:

<filter class="$filterRegexRef" onMatch="include">http://.*agcex\.org/.*</filter>

I know it is not the most accurate expression as a filter for just detecting subdomains or www., but it served for the purpose. So I run the configuration file again, with start.

The new references (without www.) were not rejected this time but I noticed a new issue, that is, a lot of references were duplicated. They were duplicated 2, 3 and more times for the events I mentioned in my previous post.

Some samples of references with duplicates:

DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=month&cal_d=1&cal_m=1&cal_y=1057
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/help.php?module=assignment&file=allowdeleting.html&forcelang=en_utf8
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming&course=1
DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
essiembre commented 9 years ago

So you see this sample line appear multiple times exactly as is in the same log file?

DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming

Changing the config and rerunning over a crawl database created with a different config can produce unexpected results. Whenever you change the config, it is always best to clear your working directory (crawl db) and start fresh. Can you do this and see if it happens again?

csaezl commented 9 years ago

This is an excerpt from the second run log for http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming

MC (crawler): 2015-08-07 12:54:25 INFO -          DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -       CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -           REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-07 12:54:25 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-07 12:54:25 INFO -            URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -         DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -    DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -          DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -       CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -           REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-07 12:54:25 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-07 12:54:25 INFO -            URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -         DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-07 12:54:25 INFO -    DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
...
MC (crawler): 2015-08-08 00:01:19 INFO -          DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO -       CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO -           REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-08 00:01:19 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-08 00:01:19 INFO -            URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO -         DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO -    DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:19 INFO -          DOCUMENT_FETCHED: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:19 INFO -       CREATED_ROBOTS_META: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: https://www.facebook.com/AGCEX
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: http://www.cyxmedia.com/funkyjoomla
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: https://twitter.com/AgcexOrg
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: https://plus.google.com/u/0/+JavierMendozacyxmedia
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-08 00:01:20 INFO -            URLS_EXTRACTED: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO -         DOCUMENT_IMPORTED: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO -    DOCUMENT_COMMITTED_ADD: http://www.agcex.org/informacion-sobre-cultura/tag/Proyecto%25252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252525252520Atalaya.html
MC (crawler): 2015-08-08 00:01:20 INFO -          DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO -       CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-08 00:01:20 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-08 00:01:20 INFO -            URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO -         DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-08 00:01:20 INFO -    DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
...
MC (crawler): 2015-08-09 00:00:19 INFO -          DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -       CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -           REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-09 00:00:19 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-09 00:00:19 INFO -            URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -         DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -    DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -          DOCUMENT_FETCHED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -       CREATED_ROBOTS_META: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -           REJECTED_FILTER: http://moodle.org
MC (crawler): 2015-08-09 00:00:19 INFO -           REJECTED_FILTER: http://www.agcex.org
MC (crawler): 2015-08-09 00:00:19 INFO -            URLS_EXTRACTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -         DOCUMENT_IMPORTED: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming
MC (crawler): 2015-08-09 00:00:19 INFO -    DOCUMENT_COMMITTED_ADD: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming

It's hard to understand that just changing the filter in the second run (with start) you can get unexpected results. Is this really the way HTTP Collector behaves?.

Having cleared the working directory, I get the same result, that is, URLs processed several times. I have an small log produced with 2.3.0-SNAPSHOT, just in case you wanted to have a look at it.

Unless my system has become crazy, you should get the same result with the URL and filter I've used for testing.

essiembre commented 9 years ago

Please provide your full config. I could not reproduce with 2.3.0-SNAPSHOT with a maxDocuments of 100..

I followed your steps. Started with your first filter and stopped it after some time. Then I modified to have your second filter instead and started it again with the action "start" (as opposed to "resume").

I found no duplicates DOCUMENT_COMMITTED_ADD in the first log, nor the second.

Does it happen after a lot of documents in your case? How many?

csaezl commented 9 years ago

In my case, I've seen the duplicates for http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming in the log on line 32xx (I've run the test twice). So, to be sure, you should run the crawler until you get 4000 lines in the log.

Here is my config file:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="MC (collector)">
  #set($filterRegexRef  = "com.norconex.collector.core.filter.impl.RegexReferenceFilter")
  #set($workdir = "C:/CRAWLER/collectors/MC/work/")

  <progressDir>$workdir/progress</progressDir>
  <logsDir>$workdir/log</logsDir>

  <crawlerDefaults>
      <delay default="100" />
      <numThreads>3</numThreads>
      <maxDepth>-1</maxDepth>
      <maxDocuments>-1</maxDocuments>
      <keepDownloads>false</keepDownloads>
      <orphansStrategy>IGNORE</orphansStrategy>
      <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>removeDotSegments</normalizations>
      </urlNormalizer>    
      <canonicalLinkDetector 
              ignore="false">
      </canonicalLinkDetector>
  </crawlerDefaults>

  <crawlers>
    <crawler id="MC (crawler)">
      <robotsTxt ignore="true" />
      <sitemap ignore="true" />
      <httpClientFactory>
            <trustAllSSLCertificates>true</trustAllSSLCertificates>
      </httpClientFactory>
      <workDir>$workdir</workDir>
      <startURLs>
        <url>http://www.agcex.org/</url>      
      </startURLs>
      <referenceFilters>
        <!-- First run -->
<!--            <filter class="$filterRegexRef" onMatch="include">http://www\.agcex\.org/.*</filter>       -->
        <!-- Second run -->
<!--            <filter class="$filterRegexRef" onMatch="include">http://.*agcex\.org/.*</filter>   -->
      </referenceFilters>

      <importer>
        <!-- max memory used for a single file. 10MB by default -->
        <maxFileCacheSize>10000000</maxFileCacheSize>
        <!-- max memory for the sum of all files.  100MB by default -->
        <maxFilePoolCacheSize>100000000</maxFilePoolCacheSize>  
      </importer>

      <committer class="com.norconex.committer.solr.SolrCommitter">
        <solrURL>http://localhost:8983/solr/MC</solrURL>
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>content</targetContentField>
        <commitBatchSize>10</commitBatchSize>
        <queueDir>$workdir/queue</queueDir>
        <queueSize>100</queueSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
        <solrUpdateURLParams>
          <param name="update.chain">langid</param>
        </solrUpdateURLParams>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>
essiembre commented 9 years ago

Does it only happen when you stop, change your rules, and start again? Or does it happen on a "normal" run as well? In other words, does the duplicates always appear for you no matter what?

csaezl commented 9 years ago

Yes, that's the matter. It's not necessary to run the first run. You get the duplicates just with the second run, that is with <filter class="$filterRegexRef" onMatch="include">http://.*agcex\.org/.*</filter>.

I've tried with 1 thread but it also produces duplicates. So, I presume the fault is in the filter. Is it possible?

essiembre commented 9 years ago

I was able to reproduce and found the cause. It is a case of different URLs causing redirects to the same target URL. Since they are initially different, the crawler things they are different. I actually thought this case was already handled. I will investigate.

Two different URLs example (showglobal vs showcourses):

They both redirect to: http://agcex.org/_sites/teleformacion/calendar/view.php?view=upcoming

csaezl commented 9 years ago

Thank you for your support. I'm glad to read you were able to reproduce the issue.

essiembre commented 9 years ago

I just created a new 2.3.0 snapshot release. It fixes the issue with many URL redirecting to the same URL causing commit duplicates.

Please try and let me know.

csaezl commented 9 years ago

It seems to work. I've searched for duplicates for several URLs that duplicated on the previous version, and now they are committed just once.

essiembre commented 9 years ago

Great. Thanks for confirming.