commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
122 stars 24 forks source link

Index presence of ads, trackers #34

Open mlinksva opened 8 years ago

mlinksva commented 8 years ago

https://filterlists.com/ could help determine.

Allow users to filter based on index and/or boost results lacking presence.

Looking at https://about.commonsearch.org/values it seems such filters would be mainstream (more so than license filters) and possibly aligned with privacy, though as stated the value is only about what Common Search does with user data. But Common Search's independence could allow it to take stronger (or at least different) measures to protect searchers than Google does.

I'd love to be able to search the web sans ad-laden sites. Not to avoid the ads (for that I use an ad blocker) but to avoid the junk content. Searching for info on many consumer products on Google, one has to wade through ad/affiliate-driven reviews and stores to find neutral information or even information provided by the manufacturer. Filtering out stores would be harder so I didn't put in the title of this issue.

sylvinus commented 8 years ago

Wow I didn't know about filterlists, looks very useful, thanks!

We could use some of those lists for better parsing, for instance better remove cookie notices that usually pollute the top of the pages. => #35 :-)

I definitely agree that junk content should be a negative ranking signal for websites. The questions is where to draw the line (or which weights to give to each category). I'm pretty sure we want to outright drop websites containing malware, but what about the rest?

Are there lists that differentiate between common trackers like Google Analytics and "less acceptable" ones?

There is also a greater discussion to have on the number of options we want to provide users with in a future "advanced search" feature. There is a balance to find between the additional stress these searches could cause on the infrastructure (because they wouldn't be part of the "mainstream" caches) and the number of users/powerusers they could interest.

indolering commented 7 years ago

Copying information over from (dupe) #59: FilterLists is working on a 2.0 version and I've requested that they include a machine readable format we could parse.

indolering commented 7 years ago

I'm pretty sure we want to outright drop websites containing malware, but what about the rest?

I think we should just warn users, these issues are typically transitory. All-things-being-equal, a result that doesn't have tracking should be promoted above one that does.

collinbarrett commented 7 years ago

hey, maintainer of FilterLists here. just discovered commonsearch via @indolering . looks like a great project! no promises on timely completion of a machine-readable format (non-monetized side-project), but it is on my radar to work on. will check back here with updates.

collinbarrett commented 7 years ago

I just launched v2 of FilterLists, and the data is now in json format on GitHub over here. Feel free to use.