USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Deduplication of Solr Documents #73

Closed karanjeets closed 7 years ago

karanjeets commented 7 years ago

Closes #71 and #72

Solr doesn't provide the Upsert methodology out of the box. Created another field dedupe_id field in Solr which stores the SHA256(crawl_id-url).

Also, commented overrides: id from solr-schema-map.yaml to include id in the field formatting for Solr. Please note that this id field is from Tika metadata.

karanjeets commented 7 years ago

@thammegowda Please review.

thammegowda commented 7 years ago

👍 Thanks @karanjeets