duckduckgo / tracker-radar

Data set of top third party web domains with rich metadata about them
Other
1.5k stars 191 forks source link

New update for https://github.com/duckduckgo/tracker-radar/tree/main/entities #130

Open pooneh-nb opened 2 years ago

pooneh-nb commented 2 years ago

I was working on a project to identify tracking/adverting domains on the Alexa echo device. I used https://github.com/duckduckgo/tracker-radar/tree/main/entities to find the parent companies behind each domain name. Thanks for sharing such a great dataset! I figured out several domain names were not available in your dataset. So, I manually look them up from ICANN, crunchbase.com, or their website. Since some are tracking/advertising websites, I think it's good to update your database. Here is the update:

{'acsechocaptiveportal.com' : 'Amazon Technologies, Inc.', 'amazon-dss.com' : 'Amazon Technologies, Inc.', 'amazonalexa.com': 'Amazon Technologies, Inc.',
'amcs-tachyon.com' : 'Amazon Technologies, Inc.', 'fireoscaptiveportal.com' : 'Amazon Technologies, Inc.', 'chtbl.com' : 'Chartable Holding Inc', 'chrt.fm' : 'Chartable Holding Inc',
'dillilabs.com' : 'Dilli Labs LLC', 'megaphone.fm' : 'Spotify AB', 'omny.fm' : 'Triton Digital, Inc.', 'podtrac.com' : 'Podtrac Inc', 'voiceapps.com' : 'Voice Apps LLC', 'mittendorf.net' : 'individual', 'doctorpooch.com' : 'Dilli Labs LLC', 'kwimer.com' : 'Highwinds Network Group, Inc'}

I'm gonna cite this dataset in our paper. Can I ask where is the source of this dataset?

kdzwinel commented 2 years ago

Hey Pouneh, thanks a lot for sharing you findings, we really appreciate it!

I'm gonna cite this dataset in our paper. Can I ask where is the source of this dataset?

Not sure if I understand your question, but this repo is the source. You can reference it like this:

"DuckDuckGo Tracker Radar", [online] Available: https://github.com/duckduckgo/tracker-radar, Retrieved: March 2022.

pooneh-nb commented 2 years ago

Hey Konrad, thanks for your reply. So my question was that what is the source of this dataset? Like did you query crunchbase.com or WHOIS to find the company behind each domain name?

kdzwinel commented 2 years ago

Ah, sorry for misunderstanding. We use public WHOIS data, SSL cert data and do manual investigation (e.g. by reviewing privacy polices). We also do semi-automatic cleanup. Small portion of the data is contributed by outside contributors. LMK if that helps!

pooneh-nb commented 2 years ago

I see that makes sense. Thank you!