Raku / doc-website

Tooling to build/run the documentation website
Artistic License 2.0
7 stars 10 forks source link

Developer access to web logs #85

Open coke opened 1 year ago

coke commented 1 year ago

We could definitely use access to whatever is serving the content's web logs so we can (at least) track any 404 requests, which probably indicate a rename or gap not addressed by the .htaccess mappings (or equivalent)

See also #104 #164 #181

dontlaugh commented 1 year ago

The current deployment artifact is a container with nginx. So let's assume that is the parsing target.

The deployment environment is Portainer. It allows shared volumes between containers. So, one way to achieve this with minimal changes to our current deployment artifact:

dontlaugh commented 1 year ago

Look into .htaccess support for existing mappings

dontlaugh commented 1 year ago

I've fetched logs from the past 24 hours of production

journalctl -u raku-doc-website --since '24 hours ago' --no-pager > logs.txt

I will parse out the 404s. I'd paste them here, but I don't want to reveal any potential PII.

UPDATE: the logs are very truncated. I think this server's default journalctl configuration might be to aggressively limit the size of logs. Or it might be a podman thing. I'll keep looking.

dontlaugh commented 1 year ago

184 gave us additional access logging, so after a day I have pulled down some aggregate info

counts.txt

Some of it is the usual randomness from the public internet, but there are legitimate clues to some missing stuff, too.

dontlaugh commented 1 year ago

@finanalyst See the file I've linked in my previous comment for some counts of 404s per uri from production.

Our Caddy access logs give us json of the following form:

{"level":"info","ts":1677922396.2839906,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"REDACTED","remote_port":"8850","proto":"HTTP/1.1","method":"GET","host":"REDACTED","uri":"/","headers":{"Content-Length":["0"],"Connection":["close"],"User-Agent":["HCLB-HealthCheck"]}},"user_id":"","duration":0.000351099,"size":18097,"status":200,"resp_headers":{"Server":["Caddy"],"Etag":["\"rqxhwadyp\""],"Content-Type":["text/html; charset=utf-8"],"Last-Modified":["Fri, 03 Mar 2023 05:00:10 GMT"],"Accept-Ranges":["bytes"],"Content-Length":["18097"]}}

We can ask journalctl for just that JSON (omitting other journal metadata with --output cat):

journalctl --output cat -u raku-doc-website > logs.txt

Then with jq and awk we can do the counting

cat logs.txt |  jq -r '"\(.status)\t\(.request.uri)"' | \
  awk '
    /^404/ {hist[$2]++} 
    END { 
        for (item in hist) {
            printf "%s\t-> %s\n", hist[item], item}
        }
    ' > counts.txt