Multiple index entries for the exact same URL

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 67 forks source link

Multiple index entries for the exact same URL #642

Closed joettt closed 3 years ago

joettt commented 5 years ago

Hello Pascal,

I am opening it again as a new issue (please refer to issue number 629) and I have no idea what is causing this on-going problem and how to fix this, As you can see, when I run the Norconex crawler, the entry is being added twice right one after the other at "timestamp":"2019-09-20T01:04:42.704Z" and "timestamp":"2019-09-20T01:04:42.724Z"

 "response":{"numFound":2,"start":0,"docs":[
      {
        "_language":"en",

        "_platform":"gccollab",

        "id":"https://gccollab.ca/groups/profile/2183003",

        "_title":["GCcollab"],

        "_type":"Group",

        "url":"https://gccollab.ca/groups/profile/2183003",

        "content":["GCMobility"],

        "_version_":1645154384167829504,

        "_root_":["https://gccollab.ca/groups/profile/2183003"],

        "timestamp":"2019-09-20T01:04:42.704Z"},

      {
        "_language":"en",

        "_platform":"gccollab",

        "id":"https://gccollab.ca/groups/profile/2183003",

        "_title":["GCcollab"],

        "_type":"Group",

        "url":"https://gccollab.ca/groups/profile/2183003",

        "content":[""],

        "_version_":1645154384188801024,

        "_root_":["https://gccollab.ca/groups/profile/2183003"],

        "timestamp":"2019-09-20T01:04:42.724Z"}]

  }}

joettt commented 5 years ago

I have even tried implementing de-duplication. During the commit, duplication does get prevented however, the whole batch to update solr index fails once duplcation is detected.

essiembre commented 5 years ago

I had yet another look at your config file in #629 and I would think maybe you are not sending the ID properly, but from your examples, it looks like you are. Just in case, I wonder if adding one or both of the following could make a difference:

Add this to your Solr Committer: <sourceReferenceField>id</sourceReferenceField>
Add this to your "KeepOnlyTagger": document.reference

If that makes no difference, I am afraid it is really a Solr-specific challenge. Unless I am missing something, all evidence you shared so far (#629) suggests the collector sends the information as it should to your Solr instance. I am not sure what causes this on Solr side.

Please confirm whether you find a reproducible issue with the Collector itself. Otherwise, I suggest you inquire to the Solr Community at: http://lucene.apache.org/solr/community.html. If you need hands-on assistance with your Solr installation, you can also contact Norconex.

joettt commented 5 years ago

Thank you for your response I have tried both of your recommendations above but still the index is being updated with additional records with same id.

The only way I am able to prevent multiple entries is by implemeting de-duplication as recommended here: https://lucene.apache.org/solr/guide/6_6/de-duplication.html. The problem with it is that the entire commit batch job fails with the error message: "Document contains multiple values for unique key field id. "

I wonder if you will consider this as a bug fix and make a change so the committer ignores the commit for the duplicate documents but updates the index for the "legitimate" entries?

joettt commented 5 years ago

I had also posted the question to the Solr community 3 weeks ago but haven't got any replies

essiembre commented 5 years ago

Right now, documents that could not be committed are tried again on next run. Would that mean losing this ability?

I will mark this as a feature request to ignore/drop/log documents that could not be committed, but that could get tricky. What if the engine is down? Then they should all be ignored? What if a batch fails... then we do them one by one instead? It opens up quite a few questions, but as long as it ends up being configurable, we should find multiple ways to improve this.

joettt commented 5 years ago

Thank you. Any documents that could not be committed should be committed in the next run if the engine goes down or if the batch fails.

This feature request will help in other areas as well where the batch fails. For example, in my schema file, the title field was set as multivalued=false. But on our site there were a few documents with multiple title fields which had resulted in entire batch failing.

essiembre commented 4 years ago

I do not know if this is related, but it just occurred to me: have you defined multiple crawlers in the same collector config? If so, make sure you add a different committer "queueDir" in each committer. If you have not set them explicitly, they have the same default. That would mean crawled documents are queued in the same directory. When that happens, if different crawlers index the same document, then there is indeed a chance you have twice the same document (same id). Solr overwrite existing documents with the same ID, but it may have issues if they are found in the same batch. Worth a try if that's your case.

essiembre commented 3 years ago

This has been implemented in the version 3 stack. See #647.

Failing batches can automatically be reattempted any number of times in full, or in smaller subsets. Ultimately, failing entries are stored separately so you can troubleshoot them, and even put them back in the queue if you want to process them again at a later time.