Ultimate-Hosts-Blacklist / dev-center

The place to talk about our infrastructure or everything related to the Ultimate Hosts Blacklist project.
MIT License
11 stars 2 forks source link

Tracking issues in justdomains's list #14

Closed dnmTX closed 5 years ago

dnmTX commented 5 years ago

@funilrys :point_right: LINK :point_left:

  1. Last clean.list update is showing that it was five days ago (last filtering finished a day ago)
  2. Currently /ACTIVE/hosts contains more domains compare to clean.list but don't want to speculate are they duplicates or just because of clean.list wasn't updated.
dnmTX commented 5 years ago

On another note: @funilrys as much as i appreciate the idea and initiation of whitelisted.list i checked to see what was removed,for example from @lightswitch05's list and this is the result: cdn.mplxtms.com cdn.optimizely.com ebayinc.demdex.net edge.quantserve.com google-analytics.com googlesyndication.com px.spiceworks.com secure-cdn.mplxtms.com ssl.google-analytics.com syndication.twitter.com www.btstatic.com

I'm not familiar with all of them(maybe @lightswitch05 wants to pitch in just to give more info as the whitelisted.list is purely optional) but i'm for sure familiar with some and i can tell you that in this state of the whitelist,I,for sure not going to use it. I even caught AVAST sending(at least trying to) data to ssl.google-analytics.com on daily bases,among being implemented in every possible website out there.

lightswitch05 commented 5 years ago

I only recently starting putting the source of my blocks in the commit messages, so I do not have the source cause for the majority of these. I'm going comment on these hosts since you asked me to, but people can whitelist whatever they want - it doesn't make any difference to me. There is a domain in my own list that I have to visit regularity for work. I simply whitelist it on my work computer - no big deal. I firmly believe everyone has the right to accept or reject whatever tracking services they choose. The 'opt-outs' that the services provide are a joke and lead to even more tracking, which is why I created and maintain my list. If there is something in my list that you find necessary or acceptable, then please whitelist it, that is your own prerogative. I do not care to be on anymore tickets about whitelisting or referenced @lightswitch05 anywhere other then in my own repo about actual false positives or major broken functionality. Anyways, here are the reasons I can find for why I added them to my list, whitelist as you see fit.

secure-cdn.mplxtms.com

cdn.mplxtms.com

Unfortunately I don't have any information on why I added these mplxtms hosts. But I believe it might be owned by MixPanel, which is:

User behavior analytics for product, marketing, and data teams.

cdn.optimizely.com

From their privacy policy:

[...] some of our advertising partners, and other third-party services and tools on our Site and/or the Optimizely Service, may use standard technologies, such as cookies, pixel tags, and web beacons, to collect information about your internet activities across websites.

ebayinc.demdex.net

DemDex is a "Audience Management solutions for powering dynamic, multi-channel data strategies online" - which is just fancy words for online tracker. Its owned by Adobe and is geared towards tracking people across websites and devices to build a comprehensive profile on them.

edge.quantserve.com

From Wikipedia:

Quantcast is an American technology company, founded in 2006, that specializes in AI-driven real-time advertising, audience insights & measurement.

ssl.google-analytics.com

google-analytics.com

googlesyndication.com

These speak for themselves.

px.spiceworks.com

I added this one just the other day:

Added px.spiceworks.com from https://www.ubnt.com/products - full url: https://px.spiceworks.com/px.js

syndication.twitter.com

This is related to twitter ads. I've seen people say it breaks things, but I regularly visit twitter and do not have issues myself. Also, I do not use my twitter account, I just look at a handful of public accounts without ever logging in - so I'm not the average user.

www.btstatic.com

btstatic.com was owned by BrightTag - which has been bought by Signal, which is a ad and tracking company:

Our mission is to connect the world’s brands to their customers at scale. In a world of fragmented devices and attention, Signal provides an open identity foundation for brands, data owners, and their marketing partners to immediately address customers in real time across any device and any marketing channel.

dnmTX commented 5 years ago

@lightswitch05 thank you for all the info provided.I hope it could help(or not) convince @funilrys that the whitelist will need serious evaluation. Now....about your comment before the domain's info.No one here disputing how you managing your repo.I trust your judgement completely,i know from a first hand that you do extensive testing before anything is added and most importantly you stand by your principals.That said,let's get to the point. This is a filtering platform and nothing else,no corporate bastards are allowed or come here or some crybabies.I'm just trying to give a hand to @funilrys to improve things here as it is still very much buggy and would encourage everyone who has his repo on rotation to do the same(which they don't) and especially the members(you're one of them man). @funilrys is on it's own here and could really use our help.Once the platform become stable it will benefit everyone so..... LET'S CONTRIBUTE EVERYONE !!!! :hugs:

smed79 commented 5 years ago

I only recently starting putting the source of my blocks in the commit messages

I suggest you to follow this particular initials of commit syntax

A: (as Addition) : means that a host to block an advertisement or Tracking has been added P: (as Problem) : means that a problem has been fixed / in case of false positive M: (as Moved/Modified) : means that a host has been moved or modified (URL not requird here)

the commit initial A/P/M, followed by a parenthetical comment detailing the reasons for the change (not required but recommended), then the URL of the page that requires the addition of a filter.

example ==> https://github.com/lightswitch05/hosts/pull/61

funilrys commented 5 years ago

Thanks for your inputs @smed79, @dnmTX, @lightswitch05 Let's review the whitelist list :) Will update those mentioned here when I have a bit of a spare time today or tomorrow :+1:

Keep up the good work :tada:

funilrys commented 5 years ago

Closing. I find out that the issue is related to the CRONJOB from Travis CI. Will write the workaround or the fix in the coming days.

May keep @smed79's notation in the future.

Also @dnmTX, from now clean.list and whitelisted.list generates the www.xyz.zyx version of xyz.zyx and vice versa.

dnmTX commented 5 years ago

Also @dnmTX, from now clean.list and whitelisted.list generates the www.xyz.zyx version of xyz.zyx and vice versa.

That sounds good but are those domains are checked if they responding when www. is inserted or you just made it to duplicate the existing one,plus/minus the www. part. I'm asking because not every domain needs it,probably 50% will not respond on any lookups aka more invalid domains in the bunch.

P.S. Is that related in any matter to my COMMENT ?

funilrys commented 5 years ago

@dnmTX Not related to that comment πŸ˜… It was planed into our internal roadmap since we started designing this project.

Will add the extra layet check into my next session.

dnmTX commented 5 years ago

@funilrys why don't you make a test.list for now to see how it's working and which domains remained after the extra filtering so we can track if any errors occur. If you merge them directly to the clean.list etc it will be impossible to monitor and as you see there is always bugs present that needs fixing.

P.S. So far from my observations sub domains don't really need the extra www. as they never change. Domains that are associated with straight websites like google.com -> www.google.com or yahoo.com -> www.yahoo.com are the ones that are missed in most of the lists provided by the curators.Maybe you can filter it this way by avoiding the sub domains.

dnmTX commented 5 years ago

Closing. I find out that the issue is related to the CRONJOB from Travis CI. Will write the workaround or the fix in the coming days.

⏰ REMINDER ⏰

dnmTX commented 5 years ago

@funilrys don't mean to rush you on the ☝️ but no repos that i'm monitoring are updating their lists,especially the ones on the front page aka clean.list,volatile.list..etc. As i'm not sure if you are aware or you working on it i'll place another ⏰ REMINDER ⏰ just in case.

funilrys commented 5 years ago

Aware just have some exam :smile_cat: so will have some free time next week :wink:

funilrys commented 5 years ago

@dnmTX. So the workaround was implemented last week and normally everything (except https://github.com/Ultimate-Hosts-Blacklist/someonewhocares.org) is running back.

From now and because it take extra times to generate the www. and vise versa correctly, I generate them systematically for top level domains (not subdomains).

I'll do my best to restart everything this night so I can take my Sunday or part of the next week to monitor all :+1:.

Thanks again for helping monitor the system.

Cheers, Nissar

funilrys commented 5 years ago

@dnmTX Between for info, all commits which starts with [MON] comes from the monitoring system I'm building :wink:

dnmTX commented 5 years ago

@dnmTX. So the workaround was implemented last week and normally everything (except https://github.com/Ultimate-Hosts-Blacklist/someonewhocares.org) is running back.

Great,thanks! Already looks better.I even updated all my lists in case of another....outage ⚑️ πŸ”Œ πŸ˜„

From now and because it take extra times to generate the www. and vise versa correctly, I generate them systematically for top level domains (not subdomains).

Yeah,subdomains are really not needed.I wish you did it so i can be able to monitor it too,this way we could've wind out even more not needed once.There are subdomains that look like domains but i guess it's not biggie.At least keep on monitor it cause it's all you there.

Thanks again for helping monitor the system.

My PLEASURE πŸ‘

@dnmTX Between for info, all commits which starts with [MON] comes from the monitoring system I'm building πŸ˜‰

Great idea.I'd say,considering how the whole thing getting more and more complex such a monitoring system is past overdue but better late then never.

dnmTX commented 5 years ago

❗️ ❗️ ❗️ @funilrys please check quidsup_notrack_trackers. It's one of the lists i'm monitoring and what i've noticed is that in clean.list,volatile.list etc the domains are doubled in number and i extracted the once that were added(with the www. method) and did random lookup and many were invalid.What i'm thinking is that in that repo the www. were just added on each domain without going trough any filtering.

dnmTX commented 5 years ago

So the workaround was implemented last week and normally everything (except https://github.com/Ultimate-Hosts-Blacklist/someonewhocares.org) is running back.

well...looks like that the workaround doesn't get the job done as all the repos that i'm monitoring not updating any of the lists on the front page.Some of them already did 2 to 3 cycles.

funilrys commented 5 years ago

@dnmTX. An update of the last days of update from my side. I did some redesign in the way we work and generate all those www. and vice-versa in order to allow the central repository to breathe and all other processes to be a bit more efficient and less approximative.

(from) Now we construct all www. and vice-versa while generating domains.list and leave PyFunceble to test them all. It is a better implementation as the last one which did generate the www. and vice-versa systematically if it was not a subdomain.

Actually, we now do less approximation as PyFunceble take the responsibility of checking (in the right way) all www. and vice-versa.

Which means:

Regarding the repositories which are stucks in cycles, I'll force restart everything around midnight in Berlin time. That will allow me to follow everything which doesn't go well when you'll report in the coming days.

I'm really glad to work with such engaged people as you @dnmTX and @mitchellkrogza (he will be back one day). I hope that we will make this whole infrastructure work better and without any issues in the coming future.

Cheers, Nissar

P.S (for those who join us): www. and vice-versa means that if for example www.github.com is listed, we generate github.com. If github.com is listed, we generate www.github.com.

dnmTX commented 5 years ago

Ok..let me give you my thoughts:

(from) Now we construct all www. and vice-versa while generating domains.list and leave PyFunceble to test them all. It is a better implementation as the last one which did generate the www. and vice-versa systematically if it was not a subdomain.

Just confirm that even with the new implementation subdomains will be skipped. That's for my piece of mind

domains.list may be silly different than the upstream but it ensures that we get covered with our mission to have www. and vice-versa blocked.

Well...the downside for me will be that i was using the domains.list as a source when i test/compare and looking for any problems.I guess from now on i'll just use the original lists/source.

volatile.list remains a copy of clean.list + all domains in output/domains/INACTIVE/list which match the SPECIAL Rules of PyFunceble.

🎡 Music to my my years 🎡

Regarding the repositories which are stucks in cycles, I'll force restart everything around midnight in Berlin time. That will allow me to follow everything which doesn't go well when you'll report in the coming days.

I hope you'll understand. I'm monitoring four repos(closely),all the time.Not all of them,but so far what i've noticed is if something goes wrong on any of them it's widespread so....you'll be hearing from me for sure.

I'm really glad to work with such engaged people as you @dnmTX and @mitchellkrogza (he will be back one day). I hope that we will make this whole infrastructure work better and without any issues in the coming future.

I APPRECIATE that you APPRECIATE πŸ˜€

funilrys commented 5 years ago

Just confirm that even with the new implementation subdomains will be skipped. That's for my piece of mind

Yes subdomains are skipped :smile:

I hope you'll understand. I'm monitoring four repos(closely),all the time.Not all of them,but so far what i've noticed is if something goes wrong on any of them it's widespread so....you'll be hearing from me for sure.

Every repos are looking the same so if one is not correct, everything else may not be correct. Only the input sources are different. Otherwise everything is the same.