commonsearch / cosr-results

Common Search sub-project for improving the quality & relevance of search results
https://about.commonsearch.org/developer/result-quality
6 stars 1 forks source link

facebook.com & other popular domains are missing #2

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

URL of the results:

https://uidemo.commonsearch.org/?g=en&q=facebook

Describe the issue precisely:

Not sure why. Other homepages don't appear because they redirect / to something else (like en.wikipedia.org) but not facebook. Having site:facebook.com queries would be very helpful for this to make sure it is not buried in the results with a bad score.

sylvinus commented 8 years ago

Seems to be missing from Common Crawl: http://index.commoncrawl.org/CC-MAIN-2015-48-index?url=facebook.com&output=json

sylvinus commented 8 years ago

This is probably because of their robots.txt policies, which whitelist crawlers like googlebot, and CCBot is not included. Also the case for Twitter, Netflix, ...

To be discussed with Common Crawl folks!

sylvinus commented 8 years ago

A temporary workaround could be to index those missing domains (popular or not) from DMOZ or Wikipedia dumps (commonsearch/cosr-back#11), so we at least return a homepage result.