awstats 7.7 treats /foo and /%66%6f%6f as two different URIs. RFC 3986, section 2.3. states that URIs that have some common ASCII characters replaced with their percent-encoded equivalents, should be treated as identical. Apache for example, will serve the same resource in such case. In practice, this mostly seems to affect URIs that include the tilde (~) character. It appears that a lot of user agents in the wild will percent-encode it, even though it is not strictly necessary.
This pull request makes awstats percent-decode the "unreserved" character range early in the process. This way, requests for "/foo" and "/%66%6f%6f" appear as one row in the "Pages-URL" statistics.
Note that this change only affects some common characters from the ASCII range (ALPHA / DIGIT / "-" / "." / "_" / "~"). It doesn't do any kind of UTF-8 decoding (as discussed here, for example)
awstats 7.7 treats
/foo
and/%66%6f%6f
as two different URIs. RFC 3986, section 2.3. states that URIs that have some common ASCII characters replaced with their percent-encoded equivalents, should be treated as identical. Apache for example, will serve the same resource in such case. In practice, this mostly seems to affect URIs that include the tilde (~
) character. It appears that a lot of user agents in the wild will percent-encode it, even though it is not strictly necessary.This pull request makes awstats percent-decode the "unreserved" character range early in the process. This way, requests for "/foo" and "/%66%6f%6f" appear as one row in the "Pages-URL" statistics.
Note that this change only affects some common characters from the ASCII range (ALPHA / DIGIT / "-" / "." / "_" / "~"). It doesn't do any kind of UTF-8 decoding (as discussed here, for example)