StevenBlack / hosts

🔒 Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
26.6k stars 2.21k forks source link

Merge unique pornography domains that do not exist in the current list #1671

Closed ghost closed 2 years ago

ghost commented 3 years ago

This is a pornography domains/subdomains list which I have maintained for at least the last year (in my own private repo). But I just made it public to share it with you:

https://github.com/elyoas/hosts/blob/main/Custom_17_Pornography

Check the make file to see how to build it and extract unique domains. All of the domains inside it are unique, they do not currently exist on Steven's lists.

StevenBlack commented 3 years ago

Hi Lyos @elyoas thank you for this.

We're mostly an aggregator of hosts files from active and reputable curators of hosts files, and we package these amalgamated hosts files in various ways.

I appreciate this offer but this list doesn't have a sustained history and track-record of active curation. This list has been online for one day.

So thank you for the offer but I'm going to decline. Can we revisit this in six-months or a year, once this list demonstrates its commitment over time?

Perhaps @clefspeare13 or @sinfonietta, two of our present porn curators, would carry this list instead?

Closing.

ghost commented 3 years ago

Of course not a problem.

ghost commented 2 years ago

@StevenBlack Would you mind reviewing this issue again please?

I have improved the list a lot. It contains thousands of pornographic domains which do not exist in the current lists.

My project on github is relatively new, but on my private repo on gitlab I had been building this list by hand for more than a year now.

I hope I can contribute something to help protect others from pornography.

StevenBlack commented 2 years ago

@elyoas the link above ⬆️ (https://github.com/elyoas/hosts/blob/main/Custom_17_Pornography) does not resolve.

dnmTX commented 2 years ago

Steve @StevenBlack it's https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography There are some domains in question though(look at the end of the file)

ghost commented 2 years ago

Any hope this will be merged please @StevenBlack ?

dnmTX commented 2 years ago

🙄 🙄 🙄 Pornography Pornography3 Pornography4 Pornography2

StevenBlack commented 2 years ago

Notes to self

The actual repo is here: https://github.com/elyoas/hosts/ (nice to have a link to the repo proper).

Using ghosts:

$ ghosts -m https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
Domains: 12,245
Bytes: 302 kB

The file has 12,278 non-comment lines so there are 12278 - 12245 = 33 questionable domains, at least according to ghosts.

Comparing to our pre-existing malware + porn file, seeing an intersection of only 162 domains, which is implausibly low. 162 / 134338 = 0.00120, one tenth of one percent intersection.

Our base list is presently 93,909 domains so our porn component is 134338 - 93909 = 40,429 domains. If we generously allocate the 162 intersecting to the 40,429 I get an intersection factor of 162 / 40429 = 0.00400702 or 0.4% which is implausible.

$ ghosts -m p -c https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/StevenBlack/hosts/master/alternates/porn/hosts
Domains: 134,338
Bytes: 4.0 MB
----------------------------------------
Compared hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
Domains: 12,245
Bytes: 302 kB
Intersection: 162 domains
StevenBlack commented 2 years ago

The TLD breakdown:

$ ghosts --tld -m https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
Domains: 12,245
Bytes: 302 kB
TLD tally:  (150 unique TLD)
   com: 7,352
   net: 919
   ru: 536
   org: 401
   me: 246
   xyz: 242
   tv: 215
   top: 209
   cc: 193
   info: 178
   pro: 166
   eu: 108
   xxx: 74
   club: 68
   icu: 67
   nl: 67
   biz: 55
   site: 51
   to: 48
   online: 48
   co: 46
   uk: 44
   mobi: 39
   cn: 34
   video: 34
   pw: 33
   su: 31
   vip: 30
   asia: 28
   ws: 27
   jp: 23
   live: 22
   cz: 21
   monster: 20
   best: 20
   us: 19
   space: 19
   name: 19
   be: 18
   in: 18
   buzz: 15
   cyou: 15
   world: 14
   one: 12
   fun: 12
   io: 11
   lv: 11
   fr: 10
   tube: 10
   ch: 9
   life: 8
   surf: 8
   win: 8
   za: 8
   de: 8
   ml: 8
   la: 8
   link: 8
   moe: 7
   se: 7
   ly: 7
   porn: 7
   pl: 6
   sk: 6
   host: 6
   sex: 5
   casa: 5
   im: 5
   wtf: 5
   ca: 5
   by: 5
   au: 5
   cf: 4
   bar: 4
   cloud: 4
   news: 4
   city: 4
   run: 4
   guru: 4
   ink: 4
   it: 4
   click: 4
   mx: 4
   email: 4
   ie: 4
   center: 3
   plus: 3
   nu: 3
   group: 3
   lol: 3
   hk: 3
   so: 3
   gq: 3
   is: 3
   red: 3
   fan: 3
   tk: 3
   li: 3
   cool: 3
   br: 3
   es: 3
   lt: 2
   cam: 2
   media: 2
   website: 2
   download: 2
   works: 2
   today: 2
   kr: 2
   social: 2
   kim: 2
   blue: 2
   ga: 2
   desi: 2
   games: 2
   work: 2
   gold: 2
   party: 2
   my: 2
   ooo: 2
   directory: 2
   gif: 2
   il: 2
   stream: 2
   pictures: 2
   ph: 2
   chat: 2
   watch: 2
   wales: 2
   zone: 2
   fm: 2
   ltd: 2
   network: 2
   bot: 2
   si: 2
   tel: 2
   gl: 2
   bz: 2
   hu: 2
   id: 2
   men: 2
   love: 2
   pub: 2
   at: 2
   ai: 2
   global: 1
   exchange: 1
   wiki: 1
   ebunga: 1
   ms: 1
----------------------------------------
StevenBlack commented 2 years ago

Dan @dnmTX raises an interesting point about Type I errors in this list.

Looking at recent commits, just 10 days ago Mozilla was listed in the proposed list.

I agree that mistakes happen. But how does Mozilla get on a blocklist in the first place?

¯\_(ツ)_/¯

ghost commented 2 years ago

1: The list contains very little matches with the existing StevenBlack pornography list because I had been removing mutual websites using this command:

grep -Fvxf removeThisList FromThisList > result

Source: https://stackoverflow.com/questions/4366533/how-to-remove-the-lines-which-appear-on-file-b-from-another-file-a

2: The file Custom_17_Pornography contains file sharing websites which host pornography, including the ones noted by @dnmTx https://github.com/StevenBlack/hosts/issues/1671#issuecomment-969379982. I will separate those specific ones in the next commit, and I will separate the rest soon.

3: How mozilla got in the list is also a mystery to me. I will track the origin of the commit which brought it in. I will post updates here soon.

StevenBlack commented 2 years ago

Thank you for those clarifications Lyos @elyoas.

ghost commented 2 years ago

How the introduction of mozilla domains (6 websites) happened: mozilla.org addons.mozilla.org

www.mozilla.org www.addons.mozilla.org

cdn.mozilla.org cdn.addons.mozilla.org

Explanation: 1: It happened in the following commit: https://github.com/elyoas/hosts/commit/337ca8ceaf2826516d79ca9652f89b3f593b5069

2: I found a website with a list of pornography categories, each one being a different website basically leading to content from the same company.

I wrote a script (actually a simple search and replace) to extract all of the websites from the page. All of them were formatted as: websitename.abc

I duplicated the previous list twice, once adding www. subdomains, and once adding cdn.; resulting in two new lists of the formats: www.websitename.abc and cdn.websitename.abc.

This explains the large number of domains added in a single commit.

3: The reason I did this was because the html code in most of them was the same, all of them were retrieving the video content from a cdn subdomain. You can verify this yourself by visiting a sample of the added websites. They all had the same parent.

4: In the process of automatic extraction, the two domains mozilla.org and addons.mozilla.org must have been collected by the script which I wrote. Both domains received www. and cdn. during step 2 above.

5: The original website from which I got this list from must be in the same commit (if not in the previous couple of commits on the same date).

ghost commented 2 years ago

Note: I have already removed a few image hosting websites from list: https://github.com/elyoas/hosts/blob/main/Custom_18_Pornography

And I put them in this list: https://github.com/elyoas/hosts/blob/main/Custom_19_ImageHosting

The list Custom_18_Pornography probably contains more unpopular image hosting services which host pornography. For now I did everything I could.

ghost commented 2 years ago

Update:

I have moved the file to another project. Please find the file at this link:

https://github.com/elyoas/ultimate-firewall/blob/main/hosts/03_manuallycollected/04_Pornography

I will stop pressing you to merge it because I know it contains image sharing websites and forums which are used for pornography.

If you are happy to receive a few complaints, then you can merge it, otherwise you can pick and choose.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.