arp242 / goatcounter

Easy web analytics. No tracking of personal data.
https://www.goatcounter.com
Other
4.46k stars 180 forks source link

Import from JSON files #446

Open arp242 opened 3 years ago

arp242 commented 3 years ago

Turns out Caddy can only log to JSON files, but importing from (arbitrary) JSON files is useful in general I think.

The biggest issue is actually accessing the JSON entries: we need to know what is what, for example Caddy looks like:

{
    "level": "info",
    "ts": 1585597114.7687502,
    "logger": "http.log.access",
    "msg": "handled request",
    "request": {
        "method": "GET",
        "uri": "/",
        "proto": "HTTP/2.0",
        "remote_addr": "127.0.0.1:50876",
        "host": "example.com",
        "headers": {
            "User-Agent": [
                "curl/7.64.1"
            ],
            "Accept": [
                "*/*"
            ]
        },
        "tls": {
            "resumed": false,
            "version": 771,
            "ciphersuite": 49196,
            "proto": "h2",
            "proto_mutual": true,
            "server_name": "example.com"
        }
    },
    "latency": 0.000014711,
    "size": 2326,
    "status": 200,
    "resp_headers": {
        "Server": [
            "Caddy"
        ],
        "Content-Type": ["text/html"]
    }
}

So it's not as simple as mapping keys; you need a "proper" query language like jq has. Perhaps there is a library for Go for this?

Also, when using -follow how do we know we have a complete log entry? For regular plain-text logs this is easy: one line is one entry. For JSON this is a bit trickier.

Related issue for goaccess: https://github.com/allinurl/goaccess/issues/1768

ForestJohnson commented 3 years ago

Also, when using -follow how do we know we have a complete log entry? For regular plain-text logs this is easy: one line is one entry. For JSON this is a bit trickier.

I think just about every system I've ever seen that logs JSON object streams logs one object per line.

I think golang has a built in JSON object stream slurper.... Yeah according to this StackOverflow post, the json.Decoder in the stdlib will read multiple json values from a stream.

ForestJohnson commented 3 years ago

Regarding JQ-style syntax as a go package, I found this which looks fairly simple and straightforward. I haven't tested it but I bet it would work for 99% of use cases. https://github.com/elgs/gojq/blob/master/gojq.go

looking at how it parses the query, I don't think it will support json properties where the property name contains a single quote, double quote, [, or ]. But who in their right mind would put that in a json property anyways :grinning:

The -format argument could look something like -format json:host=request.host,datetime=ts,referrer=request.headers.Referer[0],path=request.uri, etc.

This will also require new time format options like -datetime millisecondsSinceUnixEpoch, -datetime secondsSinceUnixEpoch, etc.

Oh, hehe I just noticed that referer is misspelled

The same person also has a library that supports some expression stuff inside the query: https://github.com/elgs/jsonql/ All of their examples are for boolean expressions, I'm not sure if it would work for formatting strings and what not.

Thinking about it, probably the only expressions you would need would be concatenation & string replacement, maybe with regex replacement? So for example if I wanted to add the hostname onto the path, maybe I could do-format json:path=request.host+request.uri? And if I wanted to shorten the hostname to make the resulting paths easier to read, I could do something like this?

-format 'json:path=replace(request.host, "sequentialread.com", "sqr")+request.uri'

Or omit the query from the uri:

-format 'json:path=replace(request.uri, regex("\?.*"), "")'

Since this shell command is starting to get hairy it might also be prudent to include a -format file:my_format.txt option where the value of the format argument can be placed inside a separate file. I guess parsing & evaluating something like that could get a little involved, but I think it would be beneficial for flexibility.

arp242 commented 3 years ago

I think just about every system I've ever seen that logs JSON object streams logs one object per line.

Yeah, probably; but this is more than just logfiles: you could also use it to migrate things from other systems for example, or write some script to send data to GoatCounter. I'd like to at least see if it can be made more flexible, but it's probably not worth it if it proves to be very hard or error-prone.

For regular goatcounter import it's not too hard, but goatcounter import -follow might be a bit trickier.

Making the assumption that the data looks like:

[
   {
        [.. request 1..]
    },
    {
       [.. request 2..]
    }
]

is probably fine.

As for the syntax, the only thing that's really needed is traversing the JSON to map them to the properties; things like regexp replacements seem a bit too much. It's not supported for plain-text logfiles either, and if you really need that kind of stuff then you can preprocess the data somewhere else (e.g. in a shell pipeline with jq, or a different script).

That gojq thing seems nice, and if there are bugs with keys containing a [ and such then this can be fixed with a PR to that project or, if the maintainer doesn't respond, using a fork.

ForestJohnson commented 3 years ago

Edit: never mind

ForestJohnson commented 3 years ago

Making the assumption that the data looks like:

    [
      {
           [.. request 1..]
       },
       {
          [.. request 2..]
       }
   ]

is probably fine.

Hmm, I don't like that because it doesn't work for streams. I think there is a pseudo-standard for JSON streams (prevalent in every JSON-based logging thingy / JSON-over-TCP protocol I've ever seen) where they look like this:

{"asd": 1}
{"asd": 2}
{"asd": 3}
{"asd": 4}
{"asd": 5}

maybe if its using -follow it assumes its a stream ("slurp" mode), otherwise if its not -follow it looks for an array at the top level like

[
  {"asd": 1},
  {"asd": 2},
  {"asd": 3}
]

Or maybe someone wants to import a pre-existing stream-styled log file all in one go, then maybe they pass a -slurp flag to tell it to use the stream style?

I guess they could probably just pipe the static file through JQ or something too, but it would require extra thought and effort from the user :shrug:

arp242 commented 3 years ago

Oh right, I had assumed that JSON logs were well-formed documents in total, but to be honest I never really worked with it. We can probably just assume that format with -follow for now; if someone needs something more then they'll report it and we'll cross that bridge when we come to it (which may be never).

When not using -follow it should probably accept both somehow, because like I said it's more than just logfiles. Maybe something like:

-format json:selector to get array of pageviews path:...

I don't know; personally I'd start by just implementing a prototype, some testcases, and writing the docs, and then go from there.

noelforte commented 3 years ago

Stumbled across this issue while trying to get GoatCounter configured to consume caddy logfiles; did there end up being a resolution toward this feature or is passing a custom format still the preferred way to work with JSON logs?

(Caddy also has a non-standard "formatter" module that I could also plug-in in a pinch, and then format for GoatCounter's CSV which it can then consume.)

ForestJohnson commented 3 years ago

I never got around to trying to contribute my custom setup to goatcounter as a a real bona-fide feature.

the only thing that's really needed is traversing the JSON to map them to the properties; things like regexp replacements seem a bit too much

I disagreed with this, I wanted this feature so I built a custom log adapter to do this. I know it might not be very helpful as a general case solution, but in case you did want to use it, heres the rundown:

It starts with this caddy config JSON file that defines a logger which logs HTTP access log JSON to a file /var/log/caddy-goatcounter.log

I generate the rest of the caddy config dynamically, but I make sure to include this log stanza in the http server definition so it will send its access logs to that logger.

{
  "http": {
    "servers": {
      "srv0": {
        ....blahblahblahblah....
        "logs": {
          "logger_names": {
            "*": "goatcounter"
          }
        }
      }
    }
  }
}

Next step: I run my custom goatcounter-caddy-log-adapter app to process the log file via the command line: tail -F /var/log/caddy-goatcounter.log | ./goatcounter-caddy-log-adapter | ./goatcounter import -site http://goatcounter.sequentialread.com:8080 -format combined-vhost -- -

Back when I implemented this, goatcounter import didn't really filter out spurious hits (things like web crawlers, access from my own home IP address, etc) so I implemented that kind of filtering within my log adapter. I don't know if goatcounter import can filter those things out now, I haven't updated it since my initial implementation.

I have been using this for about half a year and fixed some issues along the way. I still sometimes get spurious request spam & have to track it down, figure out why it wasn't filtered out. A recent example: multiple byte-range requests related to scrubbing through a video file were all being counted as individual hits.