Closed karanjeets closed 7 years ago
@karanjeets
We have to de-dup by crawl_id + url
, if we can hash it, it would be great,
Please resolve this TODO:
https://github.com/USCDataScience/sparkler/blob/master/sparkler-app/src/main/scala/edu/usc/irds/sparkler/solr/SolrUpsert.scala#L40 once you're done
@thammegowda This TODO
has to wait a little. Solr doesn't have the update handler which provides the upsert functionality. We can write our own and contribute it back to the community. I took an alternate path to implement de-duplication. Please have a look at the PR #73
@karanjeets
Solr doesn't have the update handler which provides the upsert functionality.
Did you try this - https://cwiki.apache.org/confluence/display/solr/De-Duplication and telling me that it doesnt suit our use case?
dedupe_id
id field to managedSchema or schema.xml <field name="dedupe_id" type="string" stored="true" indexed="true" multiValued="false" />
solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">dedupe_id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">crawl_id,url</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
solrconfig.xml
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
...
</requestHandler>
Let me know if have tried this and/or faced any issues with this
@thammegowda, are you suggesting that dedup_id=crawl_id + url ? And then you let SOLR dedup on that id ?
Yes dedup_id=crawl_id + url
as suggested.
What I am saying is (The TODO: here is )- instead of doing it on client side, we shall let solr handle it on the server side. That will be efficient and right way of doing
I think the crawl_id + url is not sufficient for de-duping large web crawls. If you have a crawl running over 30(or n) days having the same crawl_id, you will miss the new content that is being uploaded on the urls. For ex - A product page on a website keeps getting more reviews but the url of that product might not change. So with this technique we would not store the new page, correct ?
Other scenario could be - tracking activity on a website based on how frequently the content changes.
In my opinion, we could dedup_id = hash(page content). We could might as well use the TextProfileSignature as mentioned in the link you posted (https://cwiki.apache.org/confluence/display/solr/De-Duplication)
@thammegowda
Did you try this - https://cwiki.apache.org/confluence/display/solr/De-Duplication and telling me that it doesnt suit our use case?
Yes, I tried this. The de-duplication is not supported with atomic updates or there is a bug in Solr. Let me elaborate; at the time of adding a new document in Solr, it creates the signature with the combination of fields you specify however it updates the signature to a series of zeros when you do an atomic update on that. Therefore, it doesn't help in the de-duplication.
If it is a bug, most likely the issue is with the application of update chain
in updateRequestHandler
. It is updating the signature to zeros if we don't pass the fields, that makes it, during atomic update.
P.S. - If you look closely at the first comment, I have referred the same link.
@sujen1412
If you have a crawl running over 30(or n) days having the same crawl_id, you will miss the new content that is being uploaded on the urls.
In my opinion the crawl_id-url
combination is better because the concern you are pointing at will be taken care while crawl refresh through field retry_interval_seconds
.
I was thinking on the line of adding the fields, that contribute to De-duplication, in Sparkler configuration but that won't be a good choice looking at the future prospects. We can have de-duplication plugins if required.
In my opinion the crawl_id-url combination is better because the concern you are pointing at will be taken care while crawl refresh through field retry_interval_seconds.
crawl_id-url
combination is better for what ?
I was thinking on the line of adding the fields, that contribute to De-duplication, in Sparkler configuration but that won't be a good choice looking at the future prospects. We can have de-duplication plugins if required.
I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?
retry_interval_seconds
only tells you when to crawl that page again, it does not tell you whether that page was modified or not. Only relying on retry_interval_seconds
does not allow us to implement a smart fetch schedule that adapts to the dynamic nature of that domain. By dynamic I mean, how frequently or infrequently the content changes, which in-turn makes the crawler increase or decrease the retry_interval_seconds
automatically. For more details have a look at what is implemented in Nutch - https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
crawl_id-url combination is better for what ?
Better for the dedupe_id
Let me elaborate on my point. There are two types of de-duplication.
De-duplication of Outlinks: This is what we are discussing here. The objective is to de-duplicate outlinks so that we don't end up crawling the same page again and again. For ex - if every page of a website points back to it's home page, we would like to remove the home page URL from the outlinks so that Sparkler don't fetch it again.
De-duplication of Content:
This is, what I think, you are talking about. This is when you are refreshing the crawl or we want the crawler to fetch the page again based on the property retry_interval_seconds
. This is not implemented yet and when it will be, we will add the newly fetched document into our index and it will have the same dedupe_id. We can handle this with different Solr handlers.
I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?
I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let's push this back because it was just a random thought and not helping the issue.
De-duplication of Outlinks:
Agreed, you can dedup by crawlid+url
De-duplication of Content:
We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.
Okay, lets first complete the work on dedupe of outlinks and defer dedupe based on content to later weeks.
Getting back to the question - deduping outlinks on the server side:
Let me elaborate; at the time of adding a new document in Solr, it creates the signature with the combination of fields you specify however it updates the signature to a series of zeros when you do an atomic update on that. Therefore, it doesn't help in the de-duplication.
Thanks @karanjeets. Solr seems to have many bugs with atomic updates. We need to file an issue for this bug and let them know about this if we are sure about it. Fixing that bug will take time, so we shall revert back to our old way of handling them at the client side.
Going to merge #73 now
@thammegowda So, I have investigated further on the Solr dedupe issue. The atomic update problem can be solved if we use update.chain
as the request parameter instead of a default in the update request handler. While updating the document if we don't call update.chain
then the value of dedupe_id
preserves.
However, the utility doesn't seems to be working as expected. Although this allows the dedupe_id
to be created on the server side, it doesn't prevent documents with the same dedupe_id
entering in the index. We either have an option to overwrite the previous document OR add the new document with same dedupe_id
.
To take this on server side, we have to:
UpdateRequestProcessor
which prevents entering the documents with same dedupe_id
.addBeans
method and customize the UpdateRequest
handler.Let's merge #73 while I work on the above plan to take this at server side.
Thanks, @thammegowda :+1: Shall we close this or keep it open until we have a better solution?
@karanjeets Yes
Since now we have a more sophisticated definition of the
id
field (with timestamp included), we have to think on de-duplication of the documents.I am opening a discussion channel here to define de-duplication. Some of the suggestions are:
signature
field (but this will enforce fetching of the duplicate document even though we are not storing it)url
fieldWe can refer here for the implementation.