cloudfoundry / eirini-release

Helm release for Project Eirini
Apache License 2.0
36 stars 34 forks source link

App logs are missing sub-second precision #73

Closed jspawar closed 5 years ago

jspawar commented 5 years ago

Description

We recently noticed that after enabling app logs via: https://github.com/cloudfoundry-incubator/eirini-release/blob/master/helm/eirini/templates/daemonset-fluentd.yaml that we do successfully receive all logs from our apps; however, they are missing precision.

When tailing/following logs this is not an issue in terms of log message order, but when looking back at recent logs via cf logs --recent we see that order is no longer preserved which we suspect to be the result of this lack of precision for the log messages.

Steps to reproduce

  1. Deploy CF+Eirini
  2. Apply the aforementioned daemonset (and other needed things) to Eirini's target cluster
  3. Push an app
  4. Run cf logs my-app --recent

What was expected to happen

2019-05-08T10:09:29.20-0700 [APP/PROC/WEB/0] OUT 2019-05-08T17:09:29.203203397+00:00 line 0 2019-05-08T10:09:29.20-0700 [APP/PROC/WEB/0] OUT 2019-05-08T17:09:29.203256377+00:00 line 1

NOTE: The second set of timestamps on each of the above lines (which have correct precision) are logged by the app itself from its STDOUT

What actually happened

2019-05-08T10:09:29.00-0700 [APP/PROC/WEB/0] OUT 2019-05-08T17:09:29.203203397+00:00 line 0 2019-05-08T10:09:29.00-0700 [APP/PROC/WEB/0] OUT 2019-05-08T17:09:29.203256377+00:00 line 1

NOTE: The second set of timestamps on each of the above lines (which have correct precision) are logged by the app itself from its STDOUT

Additional information (optional)

We are deploying CF using cf-deployment and eirini-bosh-release with BOSH.

We then manually apply the fluentd daemonset to our k8s cluster

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/165891137

The labels on this github issue will be updated when the story is started.

alex-slynko commented 5 years ago

@jspawar We think that you copypasted the wrong log in What actually happened and it should be:

2019-05-08T10:09:29.00-0700 [APP/PROC/WEB/0] OUT 2019-05-08T17:09:29.203256377+00:00 line 1 2019-05-08T10:09:29.00-0700 [APP/PROC/WEB/0] OUT 2019-05-08T17:09:29.203203397+00:00 line 0

alex-slynko commented 5 years ago

We can't reproduce it in our environment. Please, check if this is related to your setup.

jenspinney commented 5 years ago

Hey @alex-slynko,

Thanks for looking into this. I think there are two symptoms we've seen related to the logging not being precise enough:

  1. The log lines we see from Eirini don't have subsecond precision (e.g., instead of 2019-05-08T10:09:29.20, we see 2019-05-08T10:09:29.00)
  2. The logs from within a single second are sometimes reverse order or randomly ordered within that second. We think this is a side effect of the above symptom.

When you reproduce this, do you see sub-second precision in your log lines? I assume the no-repro was because you didn't the see the mis-ordering of logs? The log lines from the dora app have two timestamps in them: one from the app itself and one from loggregator. When we were seeing the mis-ordering, only the loggregator-generated timestamp was the problematic one. (If any of this is unclear, happy to chat on slack or jump into a zoom call.)

It's also possible this is due to something in our configuration.

gdankov commented 5 years ago

Hey @jenspinney & @jspawar,

We were able to reproduce the subsecond precision problem you see. We found out that it was caused by fluentd. We were using v0.12 which didn't return subsecond precision. After bumping fluentd to v1.4.2 we can now see subsecond precision in the logs.

We were not actually able to reproduce the ordering problem you see -- for us it always shows the logs in the right order.

We are currently waiting for pipeline to pass before it gets on master, then we'll let you know so you can verify if it works on your environment.

Best regards, Georgi && @akshaymankar

jenspinney commented 5 years ago

Hi @gdankov and @akshaymankar,

Thanks for looking into this! It looks like the commit to bump fluentd is now on master, and there's a new "latest" tag of eirini/loggregator-fluentd. I was wondering if there will be a specific tag like 0.4.0 that would include these changes. If we reference the "latest" tag on the docker image, it could bring some surprises for us down the line, and we'd rather try it out on a tag we can pin to.

In regards to not seeing the misordering: this is just a random behavior that you may or may not see. We talked to the loggregator team when digging into this, and they said that if the logs the loggregator system receives don't have the subsecond precision, loggregator itself won't know how to order them. They may be ordered correctly sometimes, but it's not deterministic.

Thanks again, Jen

gdankov commented 5 years ago

Hi @jenspinney,

We recently decided to switch from using specific tags like x.y.z to using image digests because of their immutability and predictability. Currently, we use a digest only for our loggregator-fluentd image, but the plan is to eventually use it for all of our images.

You can find the loggregator-fluentd digest here.

Would this solution work for you?

Thanks, Georgi

alex-slynko commented 5 years ago

thank you again for reporting it @jenspinney & @jspawar The change is now in master branch.

jenspinney commented 5 years ago

FYI, we just had a chance to validate this, and it fixes the bug for us!

And yes, @gdankov, the image digest approach is great. I just didn't realize you had moved to that. Thanks!