Strip timestamp from service logs to avoid double-timestamp issue

benhoyt commented 2 years ago

Currently when a service outputs timestamps itself we get double timestamps in the log, because Pebble also adds a timestamp-and-service-name prefix to every log line. We want to trim/strip the service's timestamp to avoid the double-up.

Some notes per discussion with Harry and Gustavo:

Add a new "time-trim" field per service: defaults to "auto", can also be "disabled" (or a format string -- maybe later)
How to do time.ParsePrefix prefix parsing? Go's time.Parse doesn't support parsing a timestamp as a prefix
- See also https://pypi.org/project/python-dateutil/ and https://github.com/araddon/dateparse
Auto-detect might detect format once at startup, or re-detect every so often
Should we used the parsed time in place of the Pebble timestamp, or just drop it?
Feedback from Thomas: fair bit of overhead, do performance testing (logs/s)

rwcarlsen commented 2 years ago

I haven't been able dig up any good examples of smartly handling parse-prefix style time parsing; all libs on the nets seem to operate on an exact/complete string. The naive thing to do is to try parsing out a timestamp out of consecutively larger prefixes - starting at some minimum length until you get a successful parse. And buffering the the entire bytestring passed to a write call if it doesn't (yet) have an entire timestamp. But this would certainly be a bit slower.

The stdlib's time.Parse is pretty good at catching "extra text" errors. If we want to avoid the brute-force checking of all prefix lengths against a parse function, we could only trim times with an explicitly user-specified format string in combination with either:

a copied+mod'd version of time.Parse that returned the index of the extra text
just extracting the extra text out of the time.Parse error

Thoughts?

benhoyt commented 2 years ago

Hmm, yeah, it'd be nice there was a time.ParsePrefix, or if time.Parse returned a more structured ExtraText error.

I definitely don't like the idea of parsing consecutively larger prefixes in a loop -- seems like a very inefficient operation to do on each log line. I'm already a bit concerned about efficiency here because this will happen on every log line.

However, after looking at the time.Parse code, I don't hate the idea of matching on "extra text" as much as I thought I might. For one thing, there's already a custom time.ParseError type with a ValueElem field that stores the (unquoted) extra text. We'd still have to detect it, though, so the code for that would look something like so:

t, err := time.Parse(layout, line)
if e, ok := err.(*time.ParseError); ok && strings.HasPrefix(e.Message, ": extra text: ") {
    prefix := line[:len(line)-len(e.ValueElem)]
    t, _ = time.Parse(layout, prefix) // parsing just the prefix should succeed
    err = nil
}

The Go authors are very unlikely to change the "extra text" message, so this doesn't seem horrible and is maybe a pragmatic approach. We'd want to good tests that'd break if Go 1.25 or whatever did change it. That said, it also doesn't seem horrible to copy a version of time.Parse into our tree and modify it.

There's another alternative: have a list of regexen, and try to match the prefix using a regex first, and only call time.Parse if it matches. However, if we supported custom time formats, we'd need something that converted a Go time layout into a regex -- that's probably not too hard, but may be more work than it's worth. (With the default "auto-recognize timestamp" we could hard code the regexes.)

Overall, I'd probably lean towards the "extra text" hack to start with, and we can easily switch to a copied version of the time.Parse code later, or if there's not appetite for that approach.

benhoyt commented 2 years ago

I've also asked about this on golang-nuts -- we'll see if there's any response.

rwcarlsen commented 2 years ago

I agree with shelving a pre-parse regex step for now. And I'd also be inclined to prefer the error wrangling over vendoring time.Parse into the codebase - and only switch to the heavier approach if/when we run into problems.

rwcarlsen commented 2 years ago

Another option: we could just have the user give us a regex for each service that we use to discard the matched portion of the log line. This has the benefit that we can discard more than just timestamps - e.g. maybe there is some sort of redundant MYSQL: prefix content in the log line as well that we could skip over since pebble also puts a service name in its log lines.

benhoyt commented 2 years ago

Another option: we could just have the user give us a regex for each service that we use to discard the matched portion of the log line. This has the benefit that we can discard more than just timestamps - e.g. maybe there is some sort of redundant MYSQL: prefix content in the log line as well that we could skip over since pebble also puts a service name in its log lines.

Interesting ... yeah, that's a good idea. Explicit and simple to implement. Go's regexp package is relatively slow (guaranteed linear time, but slowish compared to other libraries) but should be good enough for this use case. I think we've previously discussed that if someone is pumping massive volumes of logs through this something's probably not configured right anyway.

Yeah, I like this approach.

benhoyt commented 2 years ago

Notes from our meeting just now:

Gustavo wants to not give up quite yet on two ideas: 1) automatically detecting the format, and 2) using Go time format strings instead of regexes. The advantage of the Go time format is it's "bidirectional" and it's a timestamp itself. Regexes are easier to get wrong and can match too much.
One of the problems we had was "how much do you match?" because the Go time library (and other libraries we looked at) parse a full string, not a prefix. However, Gustavo's suggestion was to parse the time from the first N space-separated fields, where N is the number of space-separated fields in the format string. For example, time.RFC822 has 5 fields, so grab the first 5 fields of the log line and parse that amount. Would want to ensure this approach works on all common formats.
When auto-detecting the timestamp, only perform the (potentially expensive) auto-detection on the first log line, so we're not having to guess every line. "Guessing" might simply be running through N hard-coded potential formats, but that's still relatively expensive to do every line.
Can we use the timestamp in some way, rather than just throwing it away? We have to be careful here, as we use the Pebble timestamp (generated by us) to merge log lines in the log streaming API, but we might be able to do something like use the user-written timestamp on the wire just before we send it to the client. (This seems too clever to me, but recording this part of the discussion anyway.)

rwcarlsen commented 2 years ago

Here is the latest discussion from mattermost:

rwcarlsen Here is my k8s sidecar charm log/timestamp collection to date:

loki-k8s-operator (observability): level=info ts=2022-05-10T18:37:54.594835469Z caller=table_manager.go:169 msg="uploading tables"
- prometheus-k8s-operator (observability): ts=2022-05-10T18:42:52.152Z caller=head.go:604 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
- grafana-k8s-operator (observability): t=2022-05-10T18:47:15+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_temp_user_status - v1-7"
- PostgreSQL (data platform): 2022-06-09 16:49:19.855 GMT [46] LOG: database system was shut down at 2022-06-09 16:49:18 GMT
- Pgbouncer (data platform): 2022-06-09 17:06:03.784 UTC [14] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
- mysql-k8s-operator (data platform): 2022-06-09T17:28:33.575388Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.28-0ubuntu0.20.04.3) starting as process 62
- zookeeper (data platform): 2022-06-08 05:59:34,652 [myid:] - INFO [WorkerSender[myid=0]:o.a.z.s.q.FastLeaderElection$Messenger$WorkerSender@507] - WorkerSender is down
- mongodb (data platform): {"t":{"$date":"2022-06-08T06:07:43.325+00:00"},"s":"I", "c":"-", "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"NotYetInitialized: Cannot use non-local read concern until replica set is finished initializing.","nextWakeupMillis":1200}}
- traefik (pietro): time="2022-06-09T12:53:35Z" level=info msg="Starting provider aggregator.ProviderAggregator"
- redis (data platform): 18:S 09 Jun 2022 12:48:23.182 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization 5 out of 10 don't have the time as a prefix :confused: Might it be worth considering trying to parse a timestamp starting from not just the beginning of the log line, but also after every " or =? Those cases would cover all 10 log lines here at least.

benhoyt [3:22 PM]

@rwcarlsen That's very interesting data -- I wouldn't have thought! So I see it's mostly structured logging, either in key=value format or JSON (in the case of Mongo). I'm not sure what the 18:S prefix on the redis logs refers to, but unless that's part of the timestamp that makes it only 6 out of 10 that don't have the time as a prefix.

I'm beginning to think time-trim isn't a good idea. I mean, even if we get more clever with " and = delimiters and can find the timestamp, do we then strip it out? That would leave the resulting log in a weird state, like (for the first example):

level=info ts= caller=table_manager.go:169 msg="uploading tables"

rwcarlsen 10:34 AM Yeah - the redis one is actually some sort of [PID]:[role] apparently - so not part of the timestamp :confused:

I agree that stripping the timestamp out of a structured log format seems wrong. Who would have thought that "unstructured" log formats were old-fashioned? It certainly doesn't feel like this can be the big kind of win we originally hoped for. I'm certainly okay shelving it (forever?).

benhoyt 2:52 PM @rwcarlsen Yeah, doesn't seem like we're winning given the above.

benhoyt commented 2 years ago

Given the research @rwcarlsen has done on this, and the discussion above, closing this for now. Seems like we're unlikely to solve this by just "stripping timestamps". But we can always re-open or open a new issue if we think of a better way.

canonical / pebble

Strip timestamp from service logs to avoid double-timestamp issue #91