cloudfoundry / cli

The official command line client for Cloud Foundry
https://docs.cloudfoundry.org/cf-cli
Apache License 2.0
1.75k stars 927 forks source link

CF CLI v7+ log streaming dropping considerable amount of logs #2213

Open vtumuluri opened 2 years ago

vtumuluri commented 2 years ago

Please fill out the issue checklist below and provide ALL the requested information.

Describe the bug and the command you saw an issue with When I tail the logs for an app using cf logs <app> with any CLI version after 6.43, I see that considerable amount of log messages are dropped (about 40% - 50%). Doing this with CF CLI v6.43 does not cause any messages to be dropped. I have an app that logs each incoming request body and when I tail the logs of this app using cf logs <app> and sending some sample requests as below. for i in {0..100}; do curl -H "Accept: application/json" -H "Content-Type: application/json" http://app/9LKOAC9IAFD6PCDE -d '{"EventId": "'$i'"}' &; done

What happened With CF CLI v6.50+, I only see 40% to 50% of the event ids from the above example. If I use CF CLI v6.43, I see all the event ids.

Expected behavior I expect that no log messages are dropped. When I enable verbose output, I see that CF CLI v6.50+ is polling a HTTPS endpoint for the logs and is not establishing a WebSocket connection. With CF CLI 6.43, I see that it is streaming the logs via a WebSocket connection.

Exact Steps To Reproduce Steps to reproduce the behavior; include the exact CLI commands and verbose output:

  1. Push an app that logs request bodies
  2. Run cf logs <app>
  3. Send some range-bound identifiable sample requests to the app
  4. Verify that you see all the request bodies in the logs

Provide more context

crsimmons commented 2 years ago

I'm not completely sure what your issue is but it could be related to the scale of the foundation you're testing on.

In the background cf logs calls the log-cache API (at least in v7 - I'm not sure what v6 did). We recently did some load testing of log-cache using cf logs and found that you can generally get up to about 500 logs/s per app instance before starting to see significant log loss. Full results: https://cloudfoundry.slack.com/archives/CUW93AF3M/p1631692658010100

Log-cache (and loggregator in general) start dropping a lot if the foundation is improperly scaled. We also did a load test of foundation-wide loggregator a few months ago (https://cloudfoundry.slack.com/archives/CUW93AF3M/p1626876673001000) and found that you need at least total # of logs/s from all apps and components / 10 000 Doppler VMs (where log-cache is running) to not have significant log loss.

loafoe commented 2 years ago

@crsimmons what we are actually seeing is that most version of the CF CLI are not opening a websocket connection but are rather polling the API for logs with a certain timestamp. Example:

cf version 6.43.0+815ea2f3d.2019-02-20

WEBSOCKET REQUEST: [2021-10-04T09:33:44Z]
GET /apps/0dbfe534-de9f-4484-8ab1-77b84f0362ea/stream HTTP/1.1
Host: wss://doppler.[redacted].com:4443
...
WEBSOCKET RESPONSE: [2021-10-04T09:38:44Z]
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
...

Through the websocket connection there is zero-loss of logs observed!

However a CLI compiled from v7 or v8 branches does the following repeated (polling) calls:

REQUEST: [2021-10-04T11:36:36+02:00]
GET /api/v1/read/0dbfe534-de9f-4484-8ab1-77b84f0362ea?envelope_types=LOG&start_time=1633340149351457080 HTTP/1.1
Host: log-cache.redacted.com
Authorization: [PRIVATE DATA HIDDEN]
User-Agent: cf/0.0.0-unknown-version (go1.17; arm64 darwin)

Result is huge gaps in logs.

The mystery question is why one version of the CLI can use websockets while others go into some kind of polling mode

chombium commented 2 years ago

I've seen similar behavior when trying to tail logs, with cf logs <APP_NAME>. If the app logs with a higher rate, the logs are either not shown at all or not all logs are shown. I'm working on Ubuntu GNU/Linux 18.04. Whenever I tried cf logs --recent <APP_NAME> I could see all of the logs which I've sent. The command with --recent returns the last 1000 app logs. btw 1000 is the maximum number of log envelopes which can be returned via the log-cache get api call and is defined with the limit parameter.

The problem with the missing logs originates from the way Loggregator (The CF logging stack) works and the way the logs are collected (streamed) by the cf logs command.

The app logs which are collected on the Diego cells travel across different paths until they are written in the log-cache. Single loggregator-agent running on a Diego Cell balances the load across multiple Dopplers, check the Loggregator Architecture. On top of that, the log-cache nodes have source-id affinity which means all of the logs from a single app are stored on the same log-cache node. Log-cache which is deployed on the Doppler VMs is a cluster of nodes which communicate between each other and spread the log envelopes based on the source-id (app guid). This means that the logs don't arrive in the log-cache in the order they were produced (Loggregator does not guaranties order delivery). Therefore it might take some time (not much) until all of the logs are in the cache.

When the cf logs sends get requests to log-cache it defines the start time and as the number of logs to retrieve is not defined the default number 100 logs (limit parameter from the log-cache api get call) will be used. So starting from a given start time 100 logs are being read. The setup of the log-cache client for "walking the cache" is done here. The start time is calculated as the timestamp of the last read log message minus one second.

As @crsimmons mentioned, you might have log loss due to improper scaling of the Loggregator's components, but even if everything is scaled properly, due to high log load, some logs might be written later than the moment cf logs sends the log-cache get request, they are missed by the "read window" and therefore missing in the shown log stream.

You can try another quick test with the --recent or checking the log-cache state with the log-cache cli plugin the cf tail command and compare the send vs received logs. If you're more into programming, you can check the Log Cache Client, but you will also have to take care about renewal of the oauth-token for the user, maybe get some ideas from the go-cfclient.

The wss://doppler... websocket endpoint which is used in the v6 cf cli uses the V1 Loggregator API that is deprecated and discouraged to be used. That's why the logs are being collected from the log-cache which uses the newer V2 Logregator API in the v7 and v8 of the cf cli.

@vtumuluri, @loafoe I don't know what your intentions are and what are you trying to use the app logs for. Generally speaking cf logs is intended only for debugging. If you need some more reliable way to get the app logs take a look at the syslog drains or the log-cache api.

HTH.

Best Regards, Jovan

loafoe commented 2 years ago

@chombium thanks for your detailed explanation. We have indeed positioned cf logs as a tool for debugging only for our developers on our platform. However, it has become extremely unreliable because as much as half the messages are lost with the the v7 and v8 versions of the CLI. The websocket based v6 version is still working flawlessly. We are now nudging developers to use our central logging system (based on syslog drains) but cf logs was one of the conveniences developers enjoyed for quickly debugging an app.. Those days seem to be gone now?

chombium commented 2 years ago

If it's only for debugging, depending on the app log rate the developers can still use cf logs for log log rates and cf logs --recent for higher log rates. In cases where cf logs is not useful, you can use cf tail from the log-cache cf cli plugin which is more reliable and depending on the use case the users can specify better filters for time range and type of envelopes. The plugin can also output the logs in json format as well. Please note that the start and end times in the cf tail command are in nano seconds.

I hope the new situation does not complicate your development process.