coke commented 1 year ago

We could definitely use access to whatever is serving the content's web logs so we can (at least) track any 404 requests, which probably indicate a rename or gap not addressed by the .htaccess mappings (or equivalent)

See also #104 #164 #181

dontlaugh commented 1 year ago

The current deployment artifact is a container with nginx. So let's assume that is the parsing target.

The deployment environment is Portainer. It allows shared volumes between containers. So, one way to achieve this with minimal changes to our current deployment artifact:

add a shared volume and mount it to our deployment artifact's /var/log/nginx (or equivalent, need to check nginx config)
deploy a logging sidecar that supports log rotation, and mount the same volume to it
- Big fan of Vector for this
priority 1 is ensuring our log rotation is solid so we don't fill up our disk
- the containerized Volume should protect the host, but the volume can still fill up on its own
priority 2 is writing a Vector config that aggregates data into a useful form. Easy version is probably HTTP response codes (200, 404, etc) counted and grouped by URL. We use this to find missing pages like you say.
priority 3 (reach goal) - set up HTTP redirects in Nginx based on common page misses

dontlaugh commented 1 year ago

Look into .htaccess support for existing mappings

dontlaugh commented 1 year ago

I've fetched logs from the past 24 hours of production

journalctl -u raku-doc-website --since '24 hours ago' --no-pager > logs.txt

I will parse out the 404s. I'd paste them here, but I don't want to reveal any potential PII.

UPDATE: the logs are very truncated. I think this server's default journalctl configuration might be to aggressively limit the size of logs. Or it might be a podman thing. I'll keep looking.

dontlaugh commented 1 year ago

184 gave us additional access logging, so after a day I have pulled down some aggregate info

counts.txt

Some of it is the usual randomness from the public internet, but there are legitimate clues to some missing stuff, too.

dontlaugh commented 1 year ago

@finanalyst See the file I've linked in my previous comment for some counts of 404s per uri from production.

Our Caddy access logs give us json of the following form:

{"level":"info","ts":1677922396.2839906,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"REDACTED","remote_port":"8850","proto":"HTTP/1.1","method":"GET","host":"REDACTED","uri":"/","headers":{"Content-Length":["0"],"Connection":["close"],"User-Agent":["HCLB-HealthCheck"]}},"user_id":"","duration":0.000351099,"size":18097,"status":200,"resp_headers":{"Server":["Caddy"],"Etag":["\"rqxhwadyp\""],"Content-Type":["text/html; charset=utf-8"],"Last-Modified":["Fri, 03 Mar 2023 05:00:10 GMT"],"Accept-Ranges":["bytes"],"Content-Length":["18097"]}}

We can ask journalctl for just that JSON (omitting other journal metadata with --output cat):

journalctl --output cat -u raku-doc-website > logs.txt

Then with jq and awk we can do the counting

cat logs.txt |  jq -r '"\(.status)\t\(.request.uri)"' | \
  awk '
    /^404/ {hist[$2]++} 
    END { 
        for (item in hist) {
            printf "%s\t-> %s\n", hist[item], item}
        }
    ' > counts.txt

Raku / doc-website

Developer access to web logs #85

184 gave us additional access logging, so after a day I have pulled down some aggregate info