Closed sipgyanendumishra closed 1 year ago
Do you have a single crawler or multiple instances of the same crawler? Do you use sharded Mongo clusters?
I am unsure if it is related, but we have seen this happen with concurrent access to Mongo. It can be because MongoDB does not commit transactions right away, so changes are not reflected immediately in the crawler (like whether a document has been processed already).
Where do you store your data in your committer for production use? If in a search engine or database, you typically won't get duplicates since your committer target will ensure that. Unfortunately, I am unsure what setup MongoDB you can do on your end to improve or if we can improve it in the code (suggestions or PRs are welcome).
hi i am using single crawler instance and i am using mongo db as data store engine , one more issue i faced when i am trying to use mysql db as data store engine at that time also i am facing issues i checked the core code there i seen query is written not in generic way it is written for oracle db
now in case of mongo db when i executed crawler 2nd time this type of count you will get
2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:92 - ---@V---No of data before cleaning table name =queued count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:93 - ---@V---No of data before cleaning table name =active count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:94 - ---@V---No of data before cleaning table name =processed count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:95 - ---@V---No of data before cleaning table name =cached count--3 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:135 - ---@V--- cached -> swap 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:140 - ---@V--- processed -> cached 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:145 - ---@V--- swap -> processed 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:148 - ---@V---No of data after swap table name =queued count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:149 - ---@V---No of data after swap table name =active count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:150 - ---@V---No of data after swap table name =processed count--0 2023-06-05 18:29:05 INFO [crawlBang] CrawlDocInfoService.open:151 - ---@V---No of data after swap table name =cached count--0
in case of imbided h2 db if i will execute crawler 2nd time then count will be
2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:92 - ---@V---No of data before cleaning table name =queued count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:93 - ---@V---No of data before cleaning table name =active count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:94 - ---@V---No of data before cleaning table name =processed count--3 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:95 - ---@V---No of data before cleaning table name =cached count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:135 - ---@V--- cached -> swap 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:140 - ---@V--- processed -> cached 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:145 - ---@V--- swap -> processed 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:148 - ---@V---No of data after swap table name =queued count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:149 - ---@V---No of data after swap table name =active count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:150 - ---@V---No of data after swap table name =processed count--0 2023-06-05 18:31:23 INFO [crawlBang] CrawlDocInfoService.open:151 - ---@V---No of data after swap table name =cached count--3
Do you have a single crawler or multiple instances of the same crawler? Do you use sharded Mongo clusters?
I am unsure if it is related, but we have seen this happen with concurrent access to Mongo. It can be because MongoDB does not commit transactions right away, so changes are not reflected immediately in the crawler (like whether a document has been processed already).
Where do you store your data in your committer for production use? If in a search engine or database, you typically won't get duplicates since your committer target will ensure that. Unfortunately, I am unsure what setup MongoDB you can do on your end to improve or if we can improve it in the code (suggestions or PRs are welcome).
please check the next comment
hi @essiembre essiembre i seen there is one bug in norconex crawler http collector , when you will use default data store engine at that time rename of table will work fine but when you will try use raname table with mongo data store engine at that time it will not work properly , please try to see the rename is triggered in package com.norconex.collector.core.doc; class =public class CrawlDocInfoService implements Closeable
Hi @sipgyanendumishra, have you tried by changing to delay of 2-3 seconds?, as Pascal mentioned, changes to MongoDB are not committed right away.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
hi team
in norconex http collector if i am using default mv data store in this case it is not crawling duplicate data but if i am using the same configuration but i am using mongo db in this case i am getting duplicate issue how to fix it and what is the issue behind this case
<?xml version="1.0" encoding="UTF-8"?>
this is my configuration file , what i have to do i want to use mongo db as data store engine