commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Create documents from DMOZ/Wikidata when they are missing in CC #27

Closed sylvinus closed 8 years ago

sylvinus commented 8 years ago

As mentioned in commonsearch/cosr-results#2, some big domains are missing from Common Crawl for various reasons that we will try to fix, but we should have a fallback with "fake" documents created from DMOZ and/or Wikidata items, in order to avoid any large gaping holes in the short term.

The main question is how this would fit in our current pipeline. The simplest way would probably be to iterate over entries from DMOZ & Wikidata (either with a range query on URLServer or straight from the dumps?) and send op_type=create queries to Elasticsearch, to avoid overwriting documents that were already indexed from Common Crawl: https://www.elastic.co/guide/en/elasticsearch/guide/current/create-doc.html

This method would only work after a clean reindex from Common Crawl, but this shouldn't be a big issue short-term. Open to other ideas though!

sylvinus commented 8 years ago

Current solution is not ideal but works with our current requirements of reindexing from scratch each time: we index wikidata first then commoncrawl, which will overwrite entries.

Test for this: https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_sources.py#L21