Vertispan / gwtproject.org

5 stars 3 forks source link

Collect simple, anonymous, constructive analytics #4

Open niloc132 opened 1 year ago

niloc132 commented 1 year ago

When configured as documented, logs are collected by nginx-proxy, with the standard NCSA log details plus the hostname of the vhost (real log line from production, with some specifics redacted):

nginx.1    | www.gwtproject.org 1.2.3.4 - - [04/Mar/2023:14:00:00 +0000] "GET /javadoc/latest/com/google/gwt/core/ext/Linker.html HTTP/1.1" 200 25683 "-" "User agent string logged here"

In order to better serve the GWT project itself, and respect the privacy of users, we don't need all of this information, but should still collect at least some parts of it. In the interest of transparency, we probably want to publish at least coarse details of the project, so that we know what resources are being used, volume of requests, where we're seeing 404s or redirects, etc. Optimizing for traffic isn't the goal, these should be used for spotting bugs, ensuring resources are not abused, curiosity, etc.

Breaking down the example log line from above, and considering what would be helpful and how:

The first step is probably to replace the NCSA log directive currently in use with something more specific (removing fields we don't want to use), then putting some filtering/batching/bucketing downstream, then publishing results.

tbroyer commented 1 year ago

Wrt the user agent, maybe you could log the Sec-CH-UA and Sec-CH-UA-Platform if present, and fallback to User-Agent otherwise (I have no idea what one can do with nginx; I see there are modules/plugins to let you extract values from the UA, so you can possibly log only parts of UA to make it easier to process the logs later, e.g. with the _n_th column being the Sec-CH-UA or equivalent, and directly Googlebot, Yandex for bots, Java, etc.)

Wrt the timestamp, bucketing by hour should be enough. If you want to detect spikes in trafic (possible DDOS or whatever), sure go do it (bucket by the minute or 5 minute intervals), but public analytics don't need to be more precise than by the hour IMO.

niloc132 commented 1 year ago

Thanks, I had momentarily forgotten about the user agent change. Using the new user agent headers seems like the kind of thing nginx itself will support Real Soon Now - and in the meantime, if I'm not mistaken, we merely lose the ability to see which chrome newer version of chrome is running, with the frozen string?

This explains why I am seeing

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36

as the single most popular user agent string, more so than the next two combined entries combined.

Timestamps: I'm relatively unconcerned about using analytics for any real-time purpose, we already have other uptime monitoring, and will soon have backup options for hosting. Worst case, this repo makes it very easy to change to another deployment option, at least once DNS is moved.