Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
181 stars 68 forks source link

duplicate urls being processed in the same run #533

Closed dtcyad1 closed 5 years ago

dtcyad1 commented 5 years ago

Hi Pascal,

I am seeing a lot of duplicates being processed. The number after the colon shows how many times i see it committed. This is all happening in the same run. How can I prevent this from happening?Is there a setting that I missed?

https://www.example.com/es_ES/ser:4 https://www.example.com/de_DE/ser-rp:6 https://www.example.com/es_MX/hou:3 https://www.example.com/es_ES/abo:5 https://www.example.com/about/con:3 https://www.example.com/de_DE/abt:7

This is part of my config file:

https://www.example.com
  <!-- === Recommendations: ============================================ -->

  <!-- Specify a crawler default directory where to generate files. -->
  <workDir>./examples-output/minimum</workDir>

  <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
  <maxDepth>5</maxDepth>
  <!-- <maxDocuments>9</maxDocuments> -->
  <canonicalLinkDetector ignore="false" />

  <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
  <sitemapResolverFactory ignore="false" />

  <metadataChecksummer disabled="true" keep="false" targetField="collector.checksum-metadata" class="com.norconex.collector.http.checksum.impl.LastModifiedMetadataChecksummer" />

  <orphansStrategy>DELETE</orphansStrategy>

  <!-- Be as nice as you can to sites you crawl. -->
  <delay default="0" />

  <referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpg,gif,png,ico,css,js,svg</filter>

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">de_DE</filter>

  </referenceFilters>

Thanks

essiembre commented 5 years ago

Where are the duplicates found? In the logs or in your committer target (e.g. search engine)?

If the later, it could be that you need to make the URL as your primary key (or unique ID) somehow. If a file changes or is recrawled as part of a new crawl, it will be committed again. This is expected behavior. What would be wrong is if you are saying it gets committed multiple times within the same crawling session. If so, do you have multiple collectors running at once, all sharing the same committer queueDir? This could be a cause as well. Please confirm.

dtcyad1 commented 5 years ago

Hi Pascal, yes, the commits are happening in the same crawling session. I am using the default filecommiter for debug purposes and I see the same url in multiple ref files under the crawledFiles folder. Another issue that i am seeing - as you mentioned for testing, i set the depth to 0 and it committed about 170 urls. But when i changed the depth to -1 or even 1 or 5 or 10, the count dropped to just about 50 urls.(although it should increase to more than 170)

Here is the public site that I am trying to crawl: https://www.qad.com/

Can you please try and crawl it and let me know if you can see what the issue is?

the only thing is a set of these exclusion filters - rest everything is pretty standard:

jpg,gif,png,ico,css,js,svg https://www.qad.com/portal/site https://www.qad.com/erp .*\/es_MX\/.* .*\/th_TH\/.* .*\/zh_CN\/.* .*\/ja_JP\/.* .*\/de_DE\/.* .*\/fr_FR\/.* .*\/it_IT\/.* .*\/en_IN\/.* .*\/in_ID\/.* .*\/pt_BR\/.* .*\/pl_PL\/.* .*\/es_ES\/.* .*\/nl_NL\/.* https://www.qad.com/terms-privacy/ .*\/legal\/.* https://www.qad.com/about/news https://www.qad.com/documents/3488095/3499669/qad-licensing.pdf https://www.qad.com/documents/3488095/3499669/qad-licensing.pdf/* https://www.qad.com/documents/3488095/3499669/ ttps://www.qad.com/documents/3488095/3499615/qad-erp-solutions-guide.pdf ^((http[s]?):\/)?\/?(www.qad.com)((\/)documents\/data-sheets)[\/]?.*$

Thanks

essiembre commented 5 years ago

Can you verify your logs to find out why you have less? Do you have several REJECTED_UNMODIFIED by any chance? If so, it is normal since on subsequent crawls, it will only commit documents that have changed (or new/deleted). You can disable the document checksummer if you always want to have all docs reprocessed.

If you suspect another issue, please attach your log.

dtcyad1 commented 5 years ago

Hi, I am seeing this behavior on other websites too.

website_test: 2018-11-19 16:15:11 DEBUG - ACCEPTED document reference. Reference=http://test/medicine/consultation Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false]
website_test: 2018-11-19 16:15:11 DEBUG - Queued for processing: http://test/medicine/consultation
website_test: 2018-11-19 16:15:15 DEBUG - ACCEPTED document reference. Reference=http://test/medicine/consultation Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false]
website_test: 2018-11-19 16:15:15 DEBUG - Already queued: http://test/medicine/consultation
website_test: 2018-11-19 16:15:18 DEBUG - ACCEPTED document reference. Reference=http://test/medicine/consultation Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false]
website_test: 2018-11-19 16:15:18 DEBUG - Already queued: http://test/medicine/consultation
...

website_test: 2018-11-19 16:15:45 DEBUG - Fetching HTTP headers: http://test/medicine/consultation
website_test: 2018-11-19 16:15:45 DEBUG - Encoded URI: http://test/medicine/consultation
website_test: 2018-11-19 16:15:47 INFO - DOCUMENT_METADATA_FETCHED: http://test/medicine/consultation
website_test: 2018-11-19 16:15:47 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test/medicine/consultation
website_test: 2018-11-19 16:15:47 DEBUG - ACCEPTED metadata checkum (new): Reference=http://test/medicine/consultation
website_test: 2018-11-19 16:15:47 DEBUG - Fetching document: http://test/medicine/consultation
website_test: 2018-11-19 16:15:47 DEBUG - Encoded URI: http://test/medicine/consultation
website_test: 2018-11-19 16:15:49 INFO -          DOCUMENT_FETCHED: http://test/medicine/consultation
...

and I see this pattern repeated multiple times - all in the same run and running just one thread.

As per your previous reply, it should not consider the same url again in the same run - but it is. I even have the url normaliser set to the values provided to see if that makes a difference - but it does not.

Thanks

dtcyad1 commented 5 years ago

Hi Pascal,

I noticed a couple of things that i thought might be responsible, like a # in the url and some kind of node param in the url(you can see it below). I have added filters and the url normalizer to remove them, but I don't see what else is causing the url to be re- fetched . I am attaching the text from the log file pertaining to this particular url. I have sanitized it here, but i think it looks clean enough to debug.. As you can see, for this url, it is fetched twice - in the same run and with 1 thread..

website_test.com: 2018-11-19 22:26:32 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:32 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:32 DEBUG - Queued for processing: http://test.com/content/request website_test.com: 2018-11-19 22:26:36 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:36 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:36 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:26:40 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:40 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:40 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:26:43 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:43 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:43 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:26:47 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:47 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:47 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:26:50 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:50 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:50 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:26:54 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:54 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:54 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:26:57 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:26:57 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:26:57 DEBUG - Already queued: http://test.com/content/request website_test.com: 2018-11-19 22:32:43 DEBUG - website_test.com: Processing reference: http://test.com/content/request website_test.com: 2018-11-19 22:32:43 DEBUG - Fetching HTTP headers: http://test.com/content/request website_test.com: 2018-11-19 22:32:43 DEBUG - Encoded URI: http://test.com/content/request website_test.com: 2018-11-19 22:32:45 INFO - DOCUMENT_METADATA_FETCHED: http://test.com/content/request website_test.com: 2018-11-19 22:32:45 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test.com/content/request website_test.com: 2018-11-19 22:32:45 DEBUG - ACCEPTED metadata checkum (new): Reference=http://test.com/content/request website_test.com: 2018-11-19 22:32:45 DEBUG - Fetching document: http://test.com/content/request website_test.com: 2018-11-19 22:32:45 DEBUG - Encoded URI: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 INFO - DOCUMENT_FETCHED: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - No meta robots found for: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 INFO - CREATED_ROBOTS_META: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - DOCUMENT URL ----> http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request#main-content Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:32:46 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request#main-content Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:32:46 DEBUG - Already being processed: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - URL modified from "http://test.com/content/request#main-content" to "http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request# Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:32:46 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request# Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:32:46 DEBUG - Already being processed: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - URL modified from "http://test.com/content/request#" to "http://test.com/content/request website_test.com: 2018-11-19 22:32:46 INFO - URLS_EXTRACTED: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - ACCEPTED metadata checkum (new): Reference=http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - Parser "com.norconex.importer.parser.impl.FallbackParser" about to parse "http://test.com/content/request". website_test.com: 2018-11-19 22:32:46 INFO - DOCUMENT_IMPORTED: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - ACCEPTED document checkum (new): Reference=http://test.com/content/request website_test.com: 2018-11-19 22:32:46 INFO - DOCUMENT_COMMITTED_ADD: http://test.com/content/request website_test.com: 2018-11-19 22:32:46 DEBUG - website_test.com: 00:00:03.084 to process: http://test.com/content/request website_test.com: 2018-11-19 22:33:31 DEBUG - URL redirect: http://test.com/add/42912345?destination=node%2F3185&token=8F9qfxp1v3ex -> http://test.com/content/request website_test.com: 2018-11-19 22:33:31 DEBUG - Redirect URL encountered a second time, re-queue it again (once) in case it came from a circular reference: http://test.com/content/request website_test.com: 2018-11-19 22:33:31 DEBUG - website_test.com: Processing reference: http://test.com/content/request website_test.com: 2018-11-19 22:33:31 DEBUG - Fetching HTTP headers: http://test.com/content/request website_test.com: 2018-11-19 22:33:31 DEBUG - Encoded URI: http://test.com/content/request website_test.com: 2018-11-19 22:33:33 INFO - DOCUMENT_METADATA_FETCHED: http://test.com/content/request website_test.com: 2018-11-19 22:33:33 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test.com/content/request website_test.com: 2018-11-19 22:33:33 DEBUG - ACCEPTED metadata checkum (new): Reference=http://test.com/content/request website_test.com: 2018-11-19 22:33:33 DEBUG - Fetching document: http://test.com/content/request website_test.com: 2018-11-19 22:33:33 DEBUG - Encoded URI: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 INFO - DOCUMENT_FETCHED: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - No meta robots found for: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 INFO - CREATED_ROBOTS_META: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - DOCUMENT URL ----> http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request#main-content Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:33:34 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request#main-content Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:33:34 DEBUG - Already being processed: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - URL modified from "http://test.com/content/request#main-content" to "http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request# Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:33:34 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request# Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:33:34 DEBUG - Already being processed: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - URL modified from "http://test.com/content/request#" to "http://test.com/content/request website_test.com: 2018-11-19 22:33:34 INFO - URLS_EXTRACTED: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - ACCEPTED metadata checkum (new): Reference=http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - Parser "com.norconex.importer.parser.impl.FallbackParser" about to parse "http://test.com/content/request". website_test.com: 2018-11-19 22:33:34 INFO - DOCUMENT_IMPORTED: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - ACCEPTED document checkum (new): Reference=http://test.com/content/request website_test.com: 2018-11-19 22:33:34 INFO - DOCUMENT_COMMITTED_ADD: http://test.com/content/request website_test.com: 2018-11-19 22:33:34 DEBUG - website_test.com: 00:00:02.945 to process: http://test.com/content/request website_test.com: 2018-11-19 22:36:48 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:36:48 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:36:48 DEBUG - Already processed: http://test.com/content/request website_test.com: 2018-11-19 22:36:52 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:36:52 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:36:52 DEBUG - Already processed: http://test.com/content/request website_test.com: 2018-11-19 22:36:55 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:36:55 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:36:55 DEBUG - Already processed: http://test.com/content/request website_test.com: 2018-11-19 22:36:59 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:36:59 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:36:59 DEBUG - Already processed: http://test.com/content/request website_test.com: 2018-11-19 22:37:03 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:37:03 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:37:03 DEBUG - Already processed: http://test.com/content/request website_test.com: 2018-11-19 22:37:06 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:37:06 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:37:06 DEBUG - Already processed: http://test.com/content/request website_test.com: 2018-11-19 22:37:10 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,css,js,svg,caseSensitive=false] website_test.com: 2018-11-19 22:37:10 DEBUG - ACCEPTED document reference. Reference=http://test.com/content/request Filter=RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=destination=node] website_test.com: 2018-11-19 22:37:10 DEBUG - Already processed: http://test.com/content/request

Thanks -yogesh

essiembre commented 5 years ago

That is very odd. Can you share your config? If sensitive, you can always send it by email and reference this ticket.

dtcyad1 commented 5 years ago

Hi Pascal,

I really appreciate your help on this. What is the email I can use to send you the details?

Thanks -yogesh

Yogesh Dhavale

Cloud Search Consultant

SADA Systems

O: 813.390.4074 | SADASystems.com http://sadasystems.com/

Cloud Consulting | IT Services | App Development | Managed Services

https://twitter.com/SADASystems https://www.facebook.com/SADASystemsinc http://www.linkedin.com/company/sada-systems

https://sadasystems.com/blog/sada-systems-named-a-2017-google-cloud-north-america-sales-partner-of-the-year

On Mon, Nov 26, 2018 at 10:12 PM Pascal Essiembre notifications@github.com wrote:

That is very odd. Can you share your config? If sensitive, you can always send it by email and reference this ticket.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/533#issuecomment-441884000, or mute the thread https://github.com/notifications/unsubscribe-auth/AKwJ6ByEgq2G7hO8zNMrHFaK_8ElrRN-ks5uzK2ZgaJpZM4X_vXT .

essiembre commented 5 years ago

My email is in my Github user profile. :-)

essiembre commented 5 years ago

I was able to reproduce thanks to your config. It turns out the issue occurs when more than one page redirects to the same URL. The latest snapshot has a fix for that.

The fix is not 100% fool-proof as it memory-caches up 10,000 redirected URLs that were successfully committed. The next major release will likely have a more robust approach, but until then, the current fix should address 99.99% of cases. For massive crawls with a huge amount of redirects, you may still get a few documents committed twice (which is not a problem for most).

You may know already, but your site happens to have lots of redirects, mainly from http to https. You could reduce the number of redirects by ensuring all your links are https.

Please confirm the new snapshot works for you.

dtcyad1 commented 5 years ago

Hi Pascal,

I will test this out and let you know.

Thanks!!

Yogesh Dhavale

Cloud Search Consultant

SADA Systems

O: 813.390.4074 | SADASystems.com http://sadasystems.com/

Cloud Consulting | IT Services | App Development | Managed Services

https://twitter.com/SADASystems https://www.facebook.com/SADASystemsinc http://www.linkedin.com/company/sada-systems

https://sadasystems.com/blog/sada-systems-named-a-2017-google-cloud-north-america-sales-partner-of-the-year

On Fri, Nov 30, 2018 at 12:56 AM Pascal Essiembre notifications@github.com wrote:

I was able to reproduce thanks to your config. It turns out the issue occurs when more than one page redirects to the same URL. The latest snapshot https://www.norconex.com/collectors/collector-http/download has a fix for that.

The fix is not 100% fool-proof as it memory-caches up 10,000 redirected URLs that were successfully committed. The next major release will likely have a more robust approach, but until then, the current fix should address 99.99% of cases. For massive crawls with a huge amount of redirects, you may still get a few documents committed twice (which is not a problem for most).

You may know already, but your site happens to have lots of redirects, mainly from http to https. You could reduce the number of redirects by ensuring all your links are https.

Please confirm the new snapshot works for you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/533#issuecomment-443099904, or mute the thread https://github.com/notifications/unsubscribe-auth/AKwJ6M2MVdcUoux34DtJMENBaxPVepshks5u0MiIgaJpZM4X_vXT .

dtcyad1 commented 5 years ago

Hi Pascal,

Thanks for the fix - the test site results work as expected. Did not see any duplicate urls on the same run. I will test this on a couple of other sites, but you can go ahead and close this!!

Really appreciate your fast turnaround on this.

Thanks

essiembre commented 5 years ago

Thanks for confirming!