cloudfoundry / log-cache-cli

Better interactions with Cloudfoundry's Log Cache
Apache License 2.0
17 stars 15 forks source link

`cf tail` loses messages #32

Closed keymon closed 1 year ago

keymon commented 6 years ago

Summary

cf tail drops messages, specially in multiline messages, likely due the implementation of log-cache.walk() function.

That function polls the log-cache API, increasing the time each time, but if messages are still being writing in log-cache for the exact last seconds, cf tail will miss them. (see below for more details)

Context

We are deploying log-cache and using this plugin (cf tail) with cf-acceptance-tests to test the service, as enabled here: https://github.com/cloudfoundry/cf-acceptance-tests/commit/923a0fe9335bf0a1e75286801e952f248491cd38

But we found that the following test is flaky:

[Fail] [apps] Default Environment Variables Default staging environment variables [It] applies default environment variables while running apps and tasks 

/tmp/build/0d76c07a/src/github.com/cloudfoundry/cf-acceptance-tests/apps/default_environment_variables.go:167

The reasons were different missing lines in the log output:

How to reproduce

  1. Deploy an application that prints ~10 lines in one go in STDOUT
  2. run cf tail <app> --follow
  3. Observe that some lines are eventually "dropped"

Possible cause

I think the problem is in the code of the logcache.Walk() function: https://github.com/cloudfoundry/go-log-cache/blob/98cf35cfc1eb9cdb0c4d2362553776593eed4a5a/walk.go#L27-L113

That functions runs a infinite loop that pools the log-cache API. If in the moment the client is reading there are messages still being written in log-cache, the next loop will ignore and miss them.

One potential solution can be add an option to set the delay for messages considered "too new" by using logcache.WithWalkDelay() when calling logcache.Walk():

diff --git a/pkg/command/cf/tail.go b/pkg/command/cf/tail.go
index 2ef47f2..fac8f3b 100644
--- a/pkg/command/cf/tail.go
+++ b/pkg/command/cf/tail.go
@@ -196,6 +196,7 @@ func Tail(
                        logcache.WithWalkStartTime(time.Unix(0, walkStartTime)),
                        logcache.WithWalkEnvelopeTypes(o.envelopeType),
                        logcache.WithWalkBackoff(logcache.NewAlwaysRetryBackoff(250
+                       logcache.WithWalkDelay(5*time.Second),
                )

                return
cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157995685

The labels on this github issue will be updated when the story is started.

jasonkeene commented 6 years ago

Interesting bug! @apoydence, would it be possible to do like etcd and return a header which is the last point the client read from? Then the client can pass that back to the server to continue reads from that point? I would prefer something like that over adding artificial latency.

keymon commented 6 years ago

would it be possible to do like etcd and return a header which is the last point the client read from?

That would be a great solution if the API allows that :+1: ! :)

poy commented 6 years ago

@jasonkeene Not quite... The user is doing a flavor of this now, with setting the timestamp and moving it forward via the Walk algorithm. However, if data arrives "late" (which happens on larger systems as clocks start to drift), then the Walk algorithm gets ahead of data that took longer to get to Log Cache.

Log Cache does not keep track of any extra state for an envelope (e.g., when it arrived vs the timestamp on the envelope). To do what you are suggesting, we would need extra information.

I agree that adding latency isn't great, however I would also wonder in most cases when you are trying to do something with the data (e.g., egress it to Stackdriver), would you prefer the data is fast and out of order, or a few seconds latent but in order? I would want more information before making a larger decision here.

In the meantime, adding delay in the CLI seems reasonable. We could also add a flag, but I fear that it would be a knob we would later regret.

johannaratliff commented 6 years ago

@apoydence How would adding cli delay be different than "baking" (i.e. giving log cache more time for data to come through before walking it) a little longer, which we recently tried?

keymon commented 6 years ago

I added other issue that is also related to cf tail.

Maybe the issue is more related to log-cache. One solution is that log-cache keeps some metric counter or similar, or adds the reception timestamp in milliseconds and allows to query based on that counter/timestamp.

toddboom commented 5 years ago

@keymon It doesn't look like we'll be able to solve this problem directly, but we have added a new feature that should make this less painful.

There is now a --name-filter option in the CLI that will allow you do a cf tail against a subset of metrics. The benefit of doing this is that you can isolate a stream of metrics that might be delayed or have stale timestamps and would otherwise be elided from the underlying walk algorithm. It does require Log Cache v2.1.0 or higher, but we would love some feedback once you've had a chance to try it out.

Since I think this is the most effective action we could take on this issue, I'm going to close it out now that the name filtering feature is delivered. Thanks for all your input!

vanschelven commented 2 years ago

As a (late) visitor to this issue, this doesn't inspire much confidence in me. What I'm reading is basically:

  1. cf tail loses messages. (this is in "you had one job" territory for me)
  2. fixing it is hard
  3. a workaround is (perhaps) available

Bonus points for turning off the test that prove (sometimes) this problem.

Shouldn't this issue be reopened to signify that a proper solution still hasn't been found, and that the logger is unreliable?

chombium commented 2 years ago

Hi,

I've experienced something like this in the past as well. I would say this problem can happen in CF, up unitl cf-deployment v18.0.0, when the dopplers and the log-cache processes were split to separate vm instances. In the past log-cache was attached to the firehose via a nozzle and the log envelopes where traveling from the diego-cells, through dopplers and rlps to log-cache. Depending on the path through which dopplers and rlps a log envelope was traveling to log-cache and the load on log-cache it could have happen that some envelopes come later. The consumer, the log-cache client, has to take care about ordering of the messages and making sure that it doesn't miss any message. Two ways to fix this would be to play around with the age of the messages or to adjust the size of the reading window, but than how old is old enough and how big should be the window to collect all of the messages? Even if we make them configurable, what would we set as default values? :thinking:

On the other hand, if take into account that starting from cf-deployment v18.0.0, log-cache consumes the log-envelopes directly from the diego-cell, we could mark this issue as obsolete. The only problems that can happen now are on the diego-cell itself and in the agents. If the diego-cell cannot is under scaled and cannot handle the load, if the CPU load is around 75-80% the forwarder, loggregator and syslog-agent are fighting for CPU cycles and it can happen that some of them have ingress drops. If the diego-cell is properly scaled and the log envelope load is to big the forwarder agent starts to drop around 8-9k log envelopes/s. So as the forwarder agent, the udp forwarder and the prom scraper are the only common components on the diego cell and those are the components where problems might occur. The only problem for a particular app would be that it's unlucky and runs on a diego-cell together with some chatty app. As CF is multitenant system, this is happening regularly. Therefore a CF operator should configure the application log rate limit, so that the stability and the reliability of the whole Loggregator is kept in some acceptable range.

At the other end, we have the log-cache vms where the log-cache-syslog-server and log-cache are running. I haven't seen any problems with the syslog-server and log-cache could make it's api to the other nodes inaccessible if it's to busy doing something (the sending messages to to other nodes due to the source-id affinity), but whatever is written, can be read through the log-cache api.

IMO, we can mark this as obsolete and advice CF operators to set application log rate limits.

@ctlong, @fhambrec What do you think about this?

chombium commented 2 years ago

I see that in my previous comment forgot to mention that log-cache works properly, if the VMs are scaled properly, for the load that they should handle, horizontally max 10k log envelopes/s and vertically based on the size of the log envelopes and the CPU load of the whole VM is under 50-60%. The whole information about the scaling can be found in this Slack thread.

It would be interesting to see if @keymon and @vanschelven are experiencing the same issue if they are running cf deployed with the cf-deployment v18.0.0 or newer...

ctlong commented 1 year ago

I tend to agree with @chombium.

Additionally, the tests using the log cache cf CLI plugin in CATs were re-written to use the log cache client directly, so this plugin shouldn't be responsible for any flakiness in tests.

I'm going to close this as resolved, but folks should feel free to reopen it if they find there's more to do.