funilrys / PyFunceble

The tool to check the availability or syntax of domain, IP or URL.
https://pyfunceble.github.io
Apache License 2.0
297 stars 44 forks source link

Special rule: www.example.org & m.example.org redirects to example.org #185

Closed spirillen closed 1 year ago

spirillen commented 3 years ago

Is your feature request related to a problem? Please describe. There are no such thing as ^(www|m)\..*\.tumblr\.com$

Describe the solution you'd like Wee should append a 302 rule

curl -I 'http://www.sensual-kiss.tumblr.com' 'http://m.sensual-kiss.tumblr.com'
HTTP/1.1 302 Found
Server: openresty
Date: Sun, 10 Jan 2021 17:52:39 GMT
Content-Type: text/html; charset=UTF-8
X-Rid: 853406b998b4af2af248c41f442e2565
P3p: CP="Tumblr's privacy policy is available here: https://www.tumblr.com/policy/en/privacy"
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=15552001
Location: https://sensual-kiss.tumblr.com/#_=_
X-UA-Compatible: IE=Edge,chrome=1
X-Cache: MISS from firewall.matrix.lan
X-Cache-Lookup: MISS from firewall.matrix.lan:3128
Via: 1.1 firewall.matrix.lan (squid)
Connection: keep-alive

HTTP/1.1 302 Found
Server: openresty
Date: Sun, 10 Jan 2021 17:52:40 GMT
Content-Type: text/html; charset=UTF-8
X-Rid: 406103b229eb27730826e4000e1c2063
P3p: CP="Tumblr's privacy policy is available here: https://www.tumblr.com/policy/en/privacy"
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=15552001
Location: https://sensual-kiss.tumblr.com/#_=_
X-UA-Compatible: IE=Edge,chrome=1
X-Cache: MISS from firewall.matrix.lan
X-Cache-Lookup: MISS from firewall.matrix.lan:3128
Via: 1.1 firewall.matrix.lan (squid)
Connection: keep-alive

Describe alternatives you've considered Even better would be

if [ dest == `^(www|m)\..*\.tumblr\.com($|\/.*)` ]
then
    return INVALID
fi

Additional context Add any other context or screenshots about the feature request here.

funilrys commented 3 years ago

This one is odd ... Who does that ?

spirillen commented 3 years ago

user who don't know better: https://github.com/Clefspeare13/pornhosts/issues/60

Have seen that other places as well actually, that why I suggested it in a global "scaled", other times I simply suspect some are using a script to completely headless append the m. and www. just to make there lists grow

funilrys commented 2 years ago

I was thinking about this, and I'm not sure if it's really in the scope of the SPECIAL rule ...

When I created the SPECIAL rule, it was really just to take things UP and DOWN if things are really away or back. It's an extra layer of test.

302 Found is not something I considered as criteria for taking something DOWN ...

What do you think of that ?

spirillen commented 2 years ago

I sometimes think a HTTP code 302 is down, most cases actually, unless it is part of the HSTS (HTTP Strict Transport Security) as the specified target obviously is moved.

Then the HUGE exception.... redirecting spyware like t.co bit.ly etc they are all redirecting (Didn't check there response code at they are blocked here)

That's why I suggested this as a special rule, check for a forth level domain and if there is mark it INVALID, in that way we cancircumwent the 302 question and we can't use the --complements as that is purely for the www or not www

On the other hand if it is a bigger work... and then again, when remembering the exact domain, I've seen the same "rule" could be applied elsewhere.

IF domain-level >= 4
then
    rule is INVALID
fi

The question might then become, is this a module we would like to be able to make special rules based on domain level?

spirillen commented 2 years ago

NB: as reply to the 302 specific question. 302 + 308 clearly says, don't come back here, there is nothing to see,

image

you need to go to xyz to see anything while 301+307 is temporary moved

funilrys commented 2 years ago

I understand, but this will have some consequences. It's actually not INVALID ... It actually redirects to the right domain ... At least that is what the Location header is saying.

The browser can't follow it for some obscure reason but it is actually working as-it-should:

$ curl -IL 'http://www.sensual-kiss.tumblr.com' 
HTTP/1.1 302 Found
Server: openresty
Date: Sun, 10 Oct 2021 10:35:37 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
X-Rid: 0ea88a24be0398a789080c4690f3d87a
P3p: CP="Tumblr's privacy policy is available here: https://www.tumblr.com/policy/en/privacy"
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Strict-Transport-Security: max-age=15552001
Location: https://sensual-kiss.tumblr.com/#_=_
X-UA-Compatible: IE=Edge,chrome=1

HTTP/2 200 
server: openresty
date: Sun, 10 Oct 2021 10:35:37 GMT
content-type: text/html; charset=UTF-8
vary: Accept-Encoding
vary: Accept-Encoding
x-rid: f0347074f015230059675a339d117709
p3p: CP="Tumblr's privacy policy is available here: https://www.tumblr.com/policy/en/privacy"
x-xss-protection: 1; mode=block
x-content-type-options: nosniff
strict-transport-security: max-age=15552001
x-tumblr-user: sensual-kiss
x-tumblr-pixel-0: https://px.srvcs.tumblr.com/impixu?T=1633862137&J=eyJ0eXBlIjoidXJsIiwidXJsIjoiaHR0cDovL3NlbnN1YWwta2lzcy50dW1ibHIuY29tLyIsInJlcXR5cGUiOjAsInJvdXRlIjoiLyJ9&U=KHHNEPCHIJ&K=3c17a9ae01752bbd52f5c333effe64d0e0ba0b7996b712c6147438227d16a98b--https://px.srvcs.tumblr.com/impixu?T=1633862137&J=eyJ0eXBlIjoicG9zdCIsInVybCI6Imh0dHA6Ly9zZW5zdWFsLWtpc3MudHVtYmxyLmNvbS8iLCJyZXF0eXBlIjowLCJyb3V0ZSI6Ii8iLCJwb3N0cyI6W3sicG9zdGlkIjoiNjUyNTk5NDY3ODg4MDE3NDA4IiwiYmxvZ2lkIjo1MTgxMDYzNTEsInNvdXJjZSI6MzN9LHsi
x-tumblr-pixel-1: cG9zdGlkIjoiNjQ2MTgzNjQ2MTE2NjkxOTY4IiwiYmxvZ2lkIjo1MTgxMDYzNTEsInNvdXJjZSI6MzN9LHsicG9zdGlkIjoiNjQ1NTExODY3MzI1OTIzMzI5IiwiYmxvZ2lkIjo1MTgxMDYzNTEsInNvdXJjZSI6MzN9LHsicG9zdGlkIjoiNjQ0NTIwNzgzMzk2MzcyNDgwIiwiYmxvZ2lkIjo1MTgxMDYzNTEsInNvdXJjZSI6MzN9XX0=&U=JNBPKHMDFF&K=7bdd23e4b63545c561b864f08fb2ef49cc3394bc9338b16d83272730f79d06e6
x-tumblr-pixel: 2
link: <https://64.media.tumblr.com/c734fc3e754e30ec2711f1e34829e448/e35d615ef95041c4-89/s128x128u_c1/5eeae975e3ba6d53334dca994719fbc8a57d7537.png>; rel=icon
x-ua-compatible: IE=Edge,chrome=1

This is another level of SPECIAL rule ...

spirillen commented 2 years ago

This is another level of SPECIAL rule ...

It is, and you should be considering if it is worth the effort or we might end up in a rule management hell that's better addressed with other scripts/programs

It's actually not INVALID ... It actually redirects to the right domain ... At least that is what the Location header is saying.

True, my outcome should have been INACTIVE. From the view of both maintaining a source + generating the output of those extensible huge hosts files would benefit from the removals of 302+308 while either obtaining or keeping the LOCATION in there source's

This actually open a hole new situation, debate about how to handle redirects, We have touched the topic in the past, maybe it's time to make a new issue/talk on the subject.

The browser can't follow it for some obscure reason

That is because the SSL do not cover fourth level domains, so your are redirected to an insecure zone, where the browser are stopping the site handling with a warning.

spirillen commented 2 years ago

This one is odd ... Who does that ?

Let me take a very fresh examlpe...

I duplicated the previous list twice, once adding www. subdomains, and once adding cdn.; resulting in two new lists of the formats: www.websitename.abc and cdn.websitename.abc. Source: https://github.com/StevenBlack/hosts/issues/1671#issuecomment-970165062 (§2)

spirillen commented 2 years ago

Just found this from another import....

https://mypdns.org/my-privacy-dns/porn-records/-/blob/4c5d12ff4f4d72a03e217b05862d3a2f333fc109/submit_here/imported/pornhosts.import-external-sources#L186-308

And compared to the test result, weeeel the numbers just don't add up

https://mypdns.org/my-privacy-dns/porn-records/blob/f16754525807bca2ef5f44ade66774b3d47f7b28/active_domains/output/pornhosts.import-external-sources/domains/INACTIVE/list

funilrys commented 1 year ago

Note to self: The idea is not bad. We should implement this. But subjects should be switched as INACTIVE not INVALID.

funilrys commented 1 year ago

Side notes on the implementation - itself:

  1. Follow all redirects.
  2. Compare the start domain with the end-domain and switch status accordingly. Example:
    • m.example.com -> example.com | Outcome: m.example.com as INACTIVE.
    • m.example.com -> example.org | Outcome: NO Status switch.
    • m.example.com -> a.example.com -> example.com | Outcome: m.example.com as INACTIVE.

This should only apply if the status code is in one of the 3XY.

funilrys commented 1 year ago

Side notes on the implementation - itself - when URLs are tested:

  1. Follow all redirects.
    1. Compare the start domain with the end-domain and switch status accordingly. Example:
      • m.example.com/hello/world -> example.com/hello/world | Outcome: m.example.com/hello/world as INACTIVE.
      • m.example.com/hello/world -> example.com/world/hello | Outcome: NO Status switch.
      • m.example.com/hello/world -> example.org/hello/world | Outcome: NO Status switch.
      • m.example.com/hello/world -> a.example.com/world/hello -> example.com/hello/world | Outcome: m.example.com/hello/world as INACTIVE.
spirillen commented 1 year ago

To continue https://matrix.to/#/!frMIeLrTTlrGiRMLBM:matrix.org/$gHxSAP8rlCIFF40wKybFK6VceyKc7NPmjj_RxLN7kFQ?via=matrix.org&via=anontier.nl

This is a special rule, but should be a global one as it is following the requests to the final destination, all "middlemen" is marked as potential dead

https://github.com/funilrys/PyFunceble/issues/185#issuecomment-877784789 ^(www|m)\..*\.tumblr\.com$

We will remove any useless (m.|www.|www.).domain.ccTLD and only leave potential ACTIVE records in our ACTIVE/list

You can call this --complements on steroids as it removes any middlemen from the finished result ACTIVE/list

IF domain-level >= 4
then
    rule is INVALID
fi

The question might then become, is this a module we would like to be able to make special rules based on domain level?

The rewrite for this would be:

IF the domain is in file some internal db file of domains then we do know; that any records with ^(www|m)\..*\.domain\.ccTLD$ are INVALID, we strip the prefixes and test those records that is left.

Example of such regex compliant file could be

tumblr.com = ^(www|m)\..*\. | !^([0-9a-z]{0-255}[.])?
bit.ly = !^bit.ly

https://github.com/funilrys/PyFunceble/issues/185#issuecomment-939477917

It's actually not INVALID ... It actually redirects to the right domain ... At least that is what the Location header is saying.

Yes, but you have no use of the record in any output list as it is redirecting, you would need the destination as it would help keeping the final lists as small and accurate as possible.

https://github.com/funilrys/PyFunceble/issues/185#issuecomment-1290400728

The idea is not bad. We should implement this. But subjects should be switched as INACTIVE not INVALID.

That would depend on the domain.... for bit.ly and tumblr.com INVALID is the correct results while other redirecting devils might be, by default, INACTIIVE

funilrys commented 1 year ago

IF the domain is in file some internal db file of domains then we do know; that any records with ^(www|m)..*.domain.ccTLD$ are INVALID, we strip the prefixes and test those records that is left.

That is actually another improvement for the mining mechanism...

Here we are only talking about subjects that redirect to their 2ndLD. Example m.example.org -> example.org and www.example.org -> example.org . And this SPECIAL ruler will only be triggered if the given subject starts with www. or m..

All URL shorteners are never triggered by this feature because the tested domain won't match the expected domain.

For example:

Also note: The path will be compared. If it doesn't match, nothing changes.

funilrys commented 1 year ago

There is a drawback with flaging a subject as INVALID ... A lot of users just drop and definitely delete INVALID and leave PyFunceble to retest all INACTIVE ... That's also something we have to keep in mind ...

We are only a few people in the issues section but we are a lot more users than we think 😰...

spirillen commented 1 year ago

NOTE:

Stumbled on this special domain case

  1. www.subdomain.skyblog.com = NOT supported (https://mypdns.org/my-privacy-dns/matrix/-/issues/980951)
  2. subdomain.skyblog.com = IS supported
  3. *.skyblog.com = Redirects straight to skyrock.com (https://mypdns.org/my-privacy-dns/matrix/-/issues/980951) = we now know skyblog.com is invalid

:thought_balloon: :thinking: maybe a new result list? that could also help on your comment in https://github.com/funilrys/PyFunceble/issues/185#issuecomment-1290906472 about the INVALID as they defacto are invalid cases and should be attended by list owner?

spirillen commented 1 year ago

UPDATE: About the special rule for tumblr, then they have made a change for which I have NOT investigated, ONLY observed

teen-make-selfies.tumblr.com
thesweetelite.tumblr.com

This url is empty and redirects to the default homepage, I have found about 30 of these today and they was marked active against expectation.

Any change you (@funilrys ) could spend a few minutes on this?

note to self (@spirillen) `tumblr.com`: https://mypdns.org/my-privacy-dns/matrix/-/issues/1774
funilrys commented 1 year ago

@spirillen they actually don't redirect to the home page per-say. It's all javascript. Therefore, the rule should be about the 404 status code.