Closed ghost closed 2 years ago
Hi Lyos @elyoas thank you for this.
We're mostly an aggregator of hosts files from active and reputable curators of hosts files, and we package these amalgamated hosts files in various ways.
I appreciate this offer but this list doesn't have a sustained history and track-record of active curation. This list has been online for one day.
So thank you for the offer but I'm going to decline. Can we revisit this in six-months or a year, once this list demonstrates its commitment over time?
Perhaps @clefspeare13 or @sinfonietta, two of our present porn curators, would carry this list instead?
Closing.
Of course not a problem.
@StevenBlack Would you mind reviewing this issue again please?
I have improved the list a lot. It contains thousands of pornographic domains which do not exist in the current lists.
My project on github is relatively new, but on my private repo on gitlab I had been building this list by hand for more than a year now.
I hope I can contribute something to help protect others from pornography.
@elyoas the link above ⬆️ (https://github.com/elyoas/hosts/blob/main/Custom_17_Pornography) does not resolve.
Steve @StevenBlack it's https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography There are some domains in question though(look at the end of the file)
Any hope this will be merged please @StevenBlack ?
🙄 🙄 🙄
The actual repo is here: https://github.com/elyoas/hosts/ (nice to have a link to the repo proper).
Using ghosts:
$ ghosts -m https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
Domains: 12,245
Bytes: 302 kB
The file has 12,278 non-comment lines so there are 12278 - 12245 = 33
questionable domains, at least according to ghosts.
Comparing to our pre-existing malware + porn
file, seeing an intersection of only 162 domains, which is implausibly low. 162 / 134338 = 0.00120
, one tenth of one percent intersection.
Our base list is presently 93,909 domains so our porn component is 134338 - 93909 = 40,429
domains. If we generously allocate the 162 intersecting to the 40,429 I get an intersection factor of 162 / 40429 = 0.00400702
or 0.4% which is implausible.
$ ghosts -m p -c https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/StevenBlack/hosts/master/alternates/porn/hosts
Domains: 134,338
Bytes: 4.0 MB
----------------------------------------
Compared hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
Domains: 12,245
Bytes: 302 kB
Intersection: 162 domains
The TLD breakdown:
$ ghosts --tld -m https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/elyoas/hosts/main/Custom_18_Pornography
Domains: 12,245
Bytes: 302 kB
TLD tally: (150 unique TLD)
com: 7,352
net: 919
ru: 536
org: 401
me: 246
xyz: 242
tv: 215
top: 209
cc: 193
info: 178
pro: 166
eu: 108
xxx: 74
club: 68
icu: 67
nl: 67
biz: 55
site: 51
to: 48
online: 48
co: 46
uk: 44
mobi: 39
cn: 34
video: 34
pw: 33
su: 31
vip: 30
asia: 28
ws: 27
jp: 23
live: 22
cz: 21
monster: 20
best: 20
us: 19
space: 19
name: 19
be: 18
in: 18
buzz: 15
cyou: 15
world: 14
one: 12
fun: 12
io: 11
lv: 11
fr: 10
tube: 10
ch: 9
life: 8
surf: 8
win: 8
za: 8
de: 8
ml: 8
la: 8
link: 8
moe: 7
se: 7
ly: 7
porn: 7
pl: 6
sk: 6
host: 6
sex: 5
casa: 5
im: 5
wtf: 5
ca: 5
by: 5
au: 5
cf: 4
bar: 4
cloud: 4
news: 4
city: 4
run: 4
guru: 4
ink: 4
it: 4
click: 4
mx: 4
email: 4
ie: 4
center: 3
plus: 3
nu: 3
group: 3
lol: 3
hk: 3
so: 3
gq: 3
is: 3
red: 3
fan: 3
tk: 3
li: 3
cool: 3
br: 3
es: 3
lt: 2
cam: 2
media: 2
website: 2
download: 2
works: 2
today: 2
kr: 2
social: 2
kim: 2
blue: 2
ga: 2
desi: 2
games: 2
work: 2
gold: 2
party: 2
my: 2
ooo: 2
directory: 2
gif: 2
il: 2
stream: 2
pictures: 2
ph: 2
chat: 2
watch: 2
wales: 2
zone: 2
fm: 2
ltd: 2
network: 2
bot: 2
si: 2
tel: 2
gl: 2
bz: 2
hu: 2
id: 2
men: 2
love: 2
pub: 2
at: 2
ai: 2
global: 1
exchange: 1
wiki: 1
ebunga: 1
ms: 1
----------------------------------------
Dan @dnmTX raises an interesting point about Type I errors in this list.
Looking at recent commits, just 10 days ago Mozilla was listed in the proposed list.
I agree that mistakes happen. But how does Mozilla get on a blocklist in the first place?
¯\_(ツ)_/¯
1: The list contains very little matches with the existing StevenBlack pornography list because I had been removing mutual websites using this command:
grep -Fvxf removeThisList FromThisList > result
2: The file Custom_17_Pornography contains file sharing websites which host pornography, including the ones noted by @dnmTx https://github.com/StevenBlack/hosts/issues/1671#issuecomment-969379982. I will separate those specific ones in the next commit, and I will separate the rest soon.
3: How mozilla got in the list is also a mystery to me. I will track the origin of the commit which brought it in. I will post updates here soon.
Thank you for those clarifications Lyos @elyoas.
How the introduction of mozilla domains (6 websites) happened:
mozilla.org
addons.mozilla.org
www.mozilla.org
www.addons.mozilla.org
cdn.mozilla.org
cdn.addons.mozilla.org
Explanation: 1: It happened in the following commit: https://github.com/elyoas/hosts/commit/337ca8ceaf2826516d79ca9652f89b3f593b5069
2: I found a website with a list of pornography categories, each one being a different website basically leading to content from the same company.
I wrote a script (actually a simple search and replace) to extract all of the websites from the page. All of them were formatted as: websitename.abc
I duplicated the previous list twice, once adding www.
subdomains, and once adding cdn.
; resulting in two new lists of the formats: www.websitename.abc
and cdn.websitename.abc
.
This explains the large number of domains added in a single commit.
3:
The reason I did this was because the html code in most of them was the same, all of them were retrieving the video content from a cdn
subdomain. You can verify this yourself by visiting a sample of the added websites. They all had the same parent.
4:
In the process of automatic extraction, the two domains mozilla.org
and addons.mozilla.org
must have been collected by the script which I wrote. Both domains received www.
and cdn.
during step 2 above.
5: The original website from which I got this list from must be in the same commit (if not in the previous couple of commits on the same date).
Note: I have already removed a few image hosting websites from list: https://github.com/elyoas/hosts/blob/main/Custom_18_Pornography
And I put them in this list: https://github.com/elyoas/hosts/blob/main/Custom_19_ImageHosting
The list Custom_18_Pornography probably contains more unpopular image hosting services which host pornography. For now I did everything I could.
Update:
I have moved the file to another project. Please find the file at this link:
https://github.com/elyoas/ultimate-firewall/blob/main/hosts/03_manuallycollected/04_Pornography
I will stop pressing you to merge it because I know it contains image sharing websites and forums which are used for pornography.
If you are happy to receive a few complaints, then you can merge it, otherwise you can pick and choose.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This is a pornography domains/subdomains list which I have maintained for at least the last year (in my own private repo). But I just made it public to share it with you:
https://github.com/elyoas/hosts/blob/main/Custom_17_Pornography
Check the make file to see how to build it and extract unique domains. All of the domains inside it are unique, they do not currently exist on Steven's lists.