USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

De-Duplicate documents in CrawlDB (Solr) #72

Closed karanjeets closed 7 years ago

karanjeets commented 7 years ago

Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.

I am opening a discussion channel here to define de-duplication. Some of the suggestions are:

We can refer here for the implementation.

thammegowda commented 7 years ago

@karanjeets We have to de-dup by crawl_id + url, if we can hash it, it would be great,

Please resolve this TODO: https://github.com/USCDataScience/sparkler/blob/master/sparkler-app/src/main/scala/edu/usc/irds/sparkler/solr/SolrUpsert.scala#L40 once you're done

karanjeets commented 7 years ago

@thammegowda This TODO has to wait a little. Solr doesn't have the update handler which provides the upsert functionality. We can write our own and contribute it back to the community. I took an alternate path to implement de-duplication. Please have a look at the PR #73

thammegowda commented 7 years ago

@karanjeets

Solr doesn't have the update handler which provides the upsert functionality.

Did you try this - https://cwiki.apache.org/confluence/display/solr/De-Duplication and telling me that it doesnt suit our use case?

thammegowda commented 7 years ago
  1. lets define dedupe_id id field to managedSchema or schema.xml
  <field name="dedupe_id" type="string" stored="true" indexed="true" multiValued="false" />
  1. Enable dedupe handler in solrconfig.xml
<updateRequestProcessorChain name="dedupe">
   <processor class="solr.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <str name="signatureField">dedupe_id</str>
      <bool name="overwriteDupes">false</bool>
      <str name="fields">crawl_id,url</str>
      <str name="signatureClass">solr.processor.Lookup3Signature</str>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
  1. Attach dedupe handler to update handler in solrconfig.xml
    <requestHandler name="/update" class="solr.UpdateRequestHandler" >
    <lst name="defaults">
    <str name="update.chain">dedupe</str>
    </lst>
    ...
    </requestHandler>

Let me know if have tried this and/or faced any issues with this

sujen1412 commented 7 years ago

@thammegowda, are you suggesting that dedup_id=crawl_id + url ? And then you let SOLR dedup on that id ?

thammegowda commented 7 years ago

Yes dedup_id=crawl_id + url as suggested.

What I am saying is (The TODO: here is )- instead of doing it on client side, we shall let solr handle it on the server side. That will be efficient and right way of doing

sujen1412 commented 7 years ago

I think the crawl_id + url is not sufficient for de-duping large web crawls. If you have a crawl running over 30(or n) days having the same crawl_id, you will miss the new content that is being uploaded on the urls. For ex - A product page on a website keeps getting more reviews but the url of that product might not change. So with this technique we would not store the new page, correct ?

Other scenario could be - tracking activity on a website based on how frequently the content changes.

In my opinion, we could dedup_id = hash(page content). We could might as well use the TextProfileSignature as mentioned in the link you posted (https://cwiki.apache.org/confluence/display/solr/De-Duplication)

karanjeets commented 7 years ago

@thammegowda

Did you try this - https://cwiki.apache.org/confluence/display/solr/De-Duplication and telling me that it doesnt suit our use case?

Yes, I tried this. The de-duplication is not supported with atomic updates or there is a bug in Solr. Let me elaborate; at the time of adding a new document in Solr, it creates the signature with the combination of fields you specify however it updates the signature to a series of zeros when you do an atomic update on that. Therefore, it doesn't help in the de-duplication.

If it is a bug, most likely the issue is with the application of update chain in updateRequestHandler. It is updating the signature to zeros if we don't pass the fields, that makes it, during atomic update.

P.S. - If you look closely at the first comment, I have referred the same link.

karanjeets commented 7 years ago

@sujen1412

If you have a crawl running over 30(or n) days having the same crawl_id, you will miss the new content that is being uploaded on the urls.

In my opinion the crawl_id-url combination is better because the concern you are pointing at will be taken care while crawl refresh through field retry_interval_seconds.

I was thinking on the line of adding the fields, that contribute to De-duplication, in Sparkler configuration but that won't be a good choice looking at the future prospects. We can have de-duplication plugins if required.

sujen1412 commented 7 years ago

In my opinion the crawl_id-url combination is better because the concern you are pointing at will be taken care while crawl refresh through field retry_interval_seconds.

crawl_id-url combination is better for what ?

I was thinking on the line of adding the fields, that contribute to De-duplication, in Sparkler configuration but that won't be a good choice looking at the future prospects. We can have de-duplication plugins if required.

I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?

retry_interval_seconds only tells you when to crawl that page again, it does not tell you whether that page was modified or not. Only relying on retry_interval_seconds does not allow us to implement a smart fetch schedule that adapts to the dynamic nature of that domain. By dynamic I mean, how frequently or infrequently the content changes, which in-turn makes the crawler increase or decrease the retry_interval_seconds automatically. For more details have a look at what is implemented in Nutch - https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html

karanjeets commented 7 years ago

crawl_id-url combination is better for what ?

Better for the dedupe_id

Let me elaborate on my point. There are two types of de-duplication.

I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?

I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let's push this back because it was just a random thought and not helping the issue.

sujen1412 commented 7 years ago

De-duplication of Outlinks:

Agreed, you can dedup by crawlid+url

De-duplication of Content:

We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.

thammegowda commented 7 years ago

Okay, lets first complete the work on dedupe of outlinks and defer dedupe based on content to later weeks.

Getting back to the question - deduping outlinks on the server side:

Let me elaborate; at the time of adding a new document in Solr, it creates the signature with the combination of fields you specify however it updates the signature to a series of zeros when you do an atomic update on that. Therefore, it doesn't help in the de-duplication.

Thanks @karanjeets. Solr seems to have many bugs with atomic updates. We need to file an issue for this bug and let them know about this if we are sure about it. Fixing that bug will take time, so we shall revert back to our old way of handling them at the client side.

Going to merge #73 now

karanjeets commented 7 years ago

@thammegowda So, I have investigated further on the Solr dedupe issue. The atomic update problem can be solved if we use update.chain as the request parameter instead of a default in the update request handler. While updating the document if we don't call update.chain then the value of dedupe_id preserves.

However, the utility doesn't seems to be working as expected. Although this allows the dedupe_id to be created on the server side, it doesn't prevent documents with the same dedupe_id entering in the index. We either have an option to overwrite the previous document OR add the new document with same dedupe_id.

To take this on server side, we have to:

Let's merge #73 while I work on the above plan to take this at server side.

karanjeets commented 7 years ago

Thanks, @thammegowda :+1: Shall we close this or keep it open until we have a better solution?

thammegowda commented 7 years ago

@karanjeets Yes