commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
123 stars 24 forks source link

Import DNS metadata #15

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

There are a few ranking signals we could extract from data related to the domain name registration and records:

Getting complete whois/zonefile dumps doesn't seem easy at the moment. Any ideas?

A couple interesting links:

sylvinus commented 8 years ago

Greg mentioned that Internet Archive does "survey crawls" every year with some root zones, but they are pretty spammy, as expected.

thepieuvre commented 8 years ago

Why not starting by indexing only domains from indexed Web pages and doing update when pages are updated? it is not complete, but it is a start.

sylvinus commented 8 years ago

@thepieuvre currently we reindex everything each month so we can avoid doing updates

However making sure we have the root domain indexed for each page is a good start indeed.

indolering commented 8 years ago

Those spammy URLs are just as valuable as non-spammy URLs.

sylvinus commented 8 years ago

@indolering yes there are as a dataset, but we don't want to have them in the search results, the index will be large enough as it is!

indolering commented 8 years ago

But wouldn't it help to know which pages those spammy domains link to?