commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Improve filtering of EU cookie notices #35

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

Cookie notices are more of an annoyance than regular boilerplate because they usually appear on top of the page and may pollute the snippets.

Right now we have very basic code to filter some of them, but we could use some of the lists at https://filterlists.com/ to filter more of them.

One big issue is the format of these lists though: they use CSS selectors, sometimes as complex as cofunds.co.uk###idrMasthead > .idrPageRow[style*='z-index:1']. We don't have a CSS selector engine at the moment and it's unclear if we could add one without a massive performance hit.

We may want to start by only using definitions by IDs and classes, which should take care of most cases.

Rough todo list:

indolering commented 7 years ago

Shouldn't we be filtering ads entirely? Ads can (and are) abused to manipulate search ranking.

We don't have a CSS selector engine at the moment and it's unclear if we could add one without a massive performance hit.

In terms of performance issues, well, how are you planning on handling one-page-apps and other sites that require JS?

Decide which lists to use depending on license, maintenance and coverage

I checked all of the major ones and most of the regionals, most of them are under a CC or similar OSS license. A handful prohibit non-commercial use (which is fine since we are non-profit). Many of them don't mention licenses or usage restrictions but our usage should fall under "fair use". I've asked the FilterList maintainer to add a license attribute to the machine readable list so that we can keep tabs on it.

The FilterList about page mentions something about checking for updates, I've requested more information about this feature.