commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Improve host-level PageRanks #52

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

As explained in our blog post, our host-level PageRank is very experimental and still very subject to spam.

Here is a list of our current ideas to improve it, feel free to contribute yours!

Going to URL-level PageRanks would obviously help a a lot but it is out of scope for this issue.

sylvinus commented 8 years ago

Sebastian from Common Crawl just did a very interesting first pass on spam in the dumps: https://gist.github.com/sebastian-nagel/beb244bf1f7092a06a1479335a5e268b

This script is able to detect a few webspam clusters based on their domain name and pagerank similarity.