Closed semenko closed 8 years ago
Rules that do nothing:
<rule from="^https://ib24\.csob\.sk/" to="https://ib24.csob.sk/" />
Duplicate hosts with different file names (previously UBS.com.xml vs ubs.xml). These show up as warnings in the validate scripts, but we ignore them because there are so many (!).
We could set a limit on the number of duplicate hosts we allow to prevent this from growing.
e.g. "Failure: Currently 105 rules contain duplicate hosts, hard limit: 100. "
Lots of stylistic things could be standardized.
name=".*"
vs name=".+"
(www\.)?
(capturing) and (?:www\.)?
non-capturingFor duplicate hosts, there was a notion of a dups whiltelist (--ignoredups) in trivial-validate that I broke in my refactoring, in part because it was inadequate and not kept up-to-date.
I think the right fix here is to make dups a hard-fail, but have a file containing a whitelist of allowed duplicate hostnames. In order to pass tests, ruleset maintainers would have to add to the whitelist; whitelist modifications would me subject to extra scrutiny.
From my own commit, negative exclusions that work better as actual rules:
<exclusion pattern="^http://www\.expedia\.com/(?!pub/|p/|.*Checkout)" />
vs
<rule from="^https?://(?:www\.)?expedia\.com/(?=pub/|p/|.*Checkout)".....
Perhaps we should warn on (?!) inside exclusions.
Agreed on the ?! exclusions thing, but I would say our tendency should be: only hard fails, no warnings. Otherwise the warnings pile up to the point of uselessness, as we've seen.
Also a note on your example rule: no reason for the ?=, you can just do a ?:.
Target doesn't match rules.
Current CATO ruleset:
<target host="cato.com" />
(rules are all cato.org
@semenko
rules alternate between (www.)? (capturing) and (?:www.)? non-capturing
Perhaps testing if the capture group is used? I know some sites have bad ssl support on the apex and need (?:www\.) -> www.
but others do support it so (www\.) -> $1
would work
Another one:
https to http redirects eg. https://www.ahm.com.au will redirect to http://www.ahm.com.au causing an infinite loop.
To clarify, this can only be detected by actually making a request to https://www.ahm.com.au
If a rule such as
<rule from="^http://(?:www\.)?ahm\.com\.au/"
to="https://www.ahm.com.au/" />
existed, it would cause an infinite loop due to the 301 redirect to http://www.ahm.com.au
<rule from="^http://(?:www\.)\.example\.com/"
to="https://www.example.com/" />
Edit: just noticed a bug in my own example:
from="^http://(?:www\.)\.example\.com/"
should be from="^http://(?:www\.)?\.example\.com/"
Adding to target doesn't match rules:
<rule from="^http://((?:bugs|downloads|...[snip]...|www)\.)?php\.net/"
- to="https://$1hp.net/" />
.php.net -> .hp.net :-1:
Update: Per #984, we now have ruleset coverage testing that should catch a lot of these "rule doesn't match target host" problems.
(Expanding on issue #451)
Some examples of common ruleset errors we might be able to test for.