Open sylvinus opened 8 years ago
Seems to be missing from Common Crawl: http://index.commoncrawl.org/CC-MAIN-2015-48-index?url=facebook.com&output=json
This is probably because of their robots.txt policies, which whitelist crawlers like googlebot, and CCBot is not included. Also the case for Twitter, Netflix, ...
To be discussed with Common Crawl folks!
A temporary workaround could be to index those missing domains (popular or not) from DMOZ or Wikipedia dumps (commonsearch/cosr-back#11), so we at least return a homepage result.
URL of the results:
https://uidemo.commonsearch.org/?g=en&q=facebook
Describe the issue precisely:
Not sure why. Other homepages don't appear because they redirect
/
to something else (like en.wikipedia.org) but not facebook. Havingsite:facebook.com
queries would be very helpful for this to make sure it is not buried in the results with a bad score.