eldy / AWStats

AWStats Log Analyzer project (official sources)
https://www.awstats.org
370 stars 120 forks source link

Decode RFC 3986 "unreserved chars" in URLs. #92

Closed avian2 closed 5 years ago

avian2 commented 6 years ago

awstats 7.7 treats /foo and /%66%6f%6f as two different URIs. RFC 3986, section 2.3. states that URIs that have some common ASCII characters replaced with their percent-encoded equivalents, should be treated as identical. Apache for example, will serve the same resource in such case. In practice, this mostly seems to affect URIs that include the tilde (~) character. It appears that a lot of user agents in the wild will percent-encode it, even though it is not strictly necessary.

This pull request makes awstats percent-decode the "unreserved" character range early in the process. This way, requests for "/foo" and "/%66%6f%6f" appear as one row in the "Pages-URL" statistics.

Note that this change only affects some common characters from the ASCII range (ALPHA / DIGIT / "-" / "." / "_" / "~"). It doesn't do any kind of UTF-8 decoding (as discussed here, for example)