commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Simple improvements to URL normalization #29

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

There are many simple things we could do to improve our normalized URLs and avoid duplicates:

Some good ideas there: https://github.com/iipc/webarchive-commons/tree/master/src/main/java/org/archive/url https://github.com/rajbot/surt/tree/master/surt

Code for this should be done in https://github.com/commonsearch/cosr-back/blob/master/cosrlib/url.py. Exhaustive unit tests would be great!