Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

MongoDb issue #827

Closed sipgyanendumishra closed 1 year ago

sipgyanendumishra commented 1 year ago

hi team

in norconex http collector if i am using default mv data store in this case it is not crawling duplicate data but if i am using the same configuration but i am using mongo db in this case i am getting duplicate issue how to fix it and what is the issue behind this case

<?xml version="1.0" encoding="UTF-8"?>

./work https://bengali.abplive.com/district/west-bengal-weather-updates-cyclone-mocha-storm-rain-or-temperature-increase-976034 https://bengali.abplive.com/district/ahead-of-the-panchayat-polls-isf-workers-join-tmc-976050 https://bengali.abplive.com/district/who-is-samaresh-majumdar-author-and-sahitya-akademi-award-winner-know-all-details-976395 removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters 1 0 -1 IGNORE mongodb://127.0.0.1:27017 .*976034.* .*976050.* .*976395.* og:title og:description article:published_time og:locale headlineMeta,descriptionFromMeta,dateTimePostedFromMeta,LanguageMeta headlineMeta,descriptionFromMeta,dateTimePostedFromMeta,LanguageMeta,contentNormal

this is my configuration file , what i have to do i want to use mongo db as data store engine

essiembre commented 1 year ago

Do you have a single crawler or multiple instances of the same crawler? Do you use sharded Mongo clusters?

I am unsure if it is related, but we have seen this happen with concurrent access to Mongo. It can be because MongoDB does not commit transactions right away, so changes are not reflected immediately in the crawler (like whether a document has been processed already).

Where do you store your data in your committer for production use? If in a search engine or database, you typically won't get duplicates since your committer target will ensure that. Unfortunately, I am unsure what setup MongoDB you can do on your end to improve or if we can improve it in the code (suggestions or PRs are welcome).

sipgyanendumishra commented 1 year ago

hi i am using single crawler instance and i am using mongo db as data store engine , one more issue i faced when i am trying to use mysql db as data store engine at that time also i am facing issues i checked the core code there i seen query is written not in generic way it is written for oracle db

now in case of mongo db when i executed crawler 2nd time this type of count you will get

2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:92 - ---@V---No of data before cleaning table name =queued count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:93 - ---@V---No of data before cleaning table name =active count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:94 - ---@V---No of data before cleaning table name =processed count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:95 - ---@V---No of data before cleaning table name =cached count--3 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:135 - ---@V--- cached -> swap 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:140 - ---@V--- processed -> cached 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:145 - ---@V--- swap -> processed 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:148 - ---@V---No of data after swap table name =queued count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:149 - ---@V---No of data after swap table name =active count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:150 - ---@V---No of data after swap table name =processed count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:151 - ---@V---No of data after swap table name =cached count--0

in case of imbided h2 db if i will execute crawler 2nd time then count will be

2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:92 - ---@V---No of data before cleaning table name =queued count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:93 - ---@V---No of data before cleaning table name =active count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:94 - ---@V---No of data before cleaning table name =processed count--3 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:95 - ---@V---No of data before cleaning table name =cached count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:135 - ---@V--- cached -> swap 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:140 - ---@V--- processed -> cached 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:145 - ---@V--- swap -> processed 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:148 - ---@V---No of data after swap table name =queued count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:149 - ---@V---No of data after swap table name =active count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:150 - ---@V---No of data after swap table name =processed count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:151 - ---@V---No of data after swap table name =cached count--3

sipgyanendumishra commented 1 year ago

Do you have a single crawler or multiple instances of the same crawler? Do you use sharded Mongo clusters?

I am unsure if it is related, but we have seen this happen with concurrent access to Mongo. It can be because MongoDB does not commit transactions right away, so changes are not reflected immediately in the crawler (like whether a document has been processed already).

Where do you store your data in your committer for production use? If in a search engine or database, you typically won't get duplicates since your committer target will ensure that. Unfortunately, I am unsure what setup MongoDB you can do on your end to improve or if we can improve it in the code (suggestions or PRs are welcome).

please check the next comment

sipgyanendumishra commented 1 year ago

hi @essiembre essiembre i seen there is one bug in norconex crawler http collector , when you will use default data store engine at that time rename of table will work fine but when you will try use raname table with mongo data store engine at that time it will not work properly , please try to see the rename is triggered in package com.norconex.collector.core.doc; class =public class CrawlDocInfoService implements Closeable

UtsavVanodiya7 commented 1 year ago

Hi @sipgyanendumishra, have you tried by changing to delay of 2-3 seconds?, as Pascal mentioned, changes to MongoDB are not committed right away.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.