commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Fix caching and hash collisions in _fast_make_domain_id #74

Open sebastian-nagel opened 7 years ago

sebastian-nagel commented 7 years ago
  1. do not overwrite local variable "domain" used as key for cache
  2. avoid hash collisions for pure domains (host without subdomain) by using a 64-bit hash value on domain.suffix
  3. avoid hash collisions inside large domains (aka. public suffixes, e.g., deviantart.com, wordpress.org): replace two-part hash value (32-bit subdomain + 32-bit domain.suffix) by 64-bit hash on subdomain.domain.suffix
  4. add notice that _fast_make_domain_id is not compatible with make_domain_id

In the 2016 host-level webgraph 501,252 host nodes collide with one or more (up to 5) other nodes, i.e., the share the same ID caused by a hash collision. The collisions are counted via

   zcat vertices.txt.gz | cut -d' ' -f1 | sort | uniq -d

Most of the collisions affect hosts without subdomain hashed only with 32-bit: e.g., "myfunfan.com" and "adobe.com" both with id "1831416544"). By using a 64-bit hash for these, the collisions are significantly reduced (now about 1000).

However, if a composed hash is used (32-bit per subdomain and domain.suffix) there are still many collisions for domains also known as public suffixes often with millions of subdomains. The number of collisions are:

   1083 deviantart.com
    566 wordpress.com
    184 tumblr.com
    124 xbjiaju.com
     94 polyvore.com
     82 livejournal.com

This can be fixed by using a 64-bit hash on the whole string subdomain.domain.suffix. For the 2016 host lists, I did not see any collisions.