cloudfoundry / docs-loggregator

Apache License 2.0
3 stars 34 forks source link

CF Loggregator / Observability Architecture Needs a Big Picture #74

Closed FWinkler79 closed 6 months ago

FWinkler79 commented 1 year ago

Hi there,

I recently tried to understand the CF loggregator architecure and how generally applications, but also the CF platform components themselves are monitored. I found the documentation sometimes misleading and also fuzzy. Especially, I felt an all-in one picture that would show all the components described in Loggregator Architecture and how they relate would greatly help.

So I reverse engineered (by reading and trying to make sense of it) the following picture, which I'd be happy to donate to the CF documentation, if someone on your side can confirm that it reflects the architecture properly.

Cloud Foundry Observability Architecture drawio

Drawio File: Cloud Foundry Observability Architecture.drawio.zip

Especially, I also tried to properly distinguish the signal types, i.e. application logs vs. platform logs, etc. I already confirmed much of the picture with colleagues on our side, but you are the experts of course.

So if this helps, I would be happy to see this added to the CF Loggregator documentation. I feel it would help understand the architecture much better. It might also make discussions about evolutions of the architecture (e.g. extending it for native OpenTelemetry support) easier.

Note: This is showing the Firehose architecture not the "Shared Nothing" architecture.

cf-gitbot commented 1 year ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

Benjamintf1 commented 1 year ago

The things that stick out at me immediately are 1) I don't believe the interface for rlp to doppler is "push" to my knowledge, at least, anymore then traffic-controller is. 2) we might not need to document the rlp path to log-cache at this point on docs.cloudfoundry.org? 3) some components could still send directly to forwarder agent that arn't rep 4) I'm not sure why "short term storage" is marked as a component in log-cache. It's all in memory.

EDIT: Actually it seems like for 1 you've got a mix of representing data flow and representing connection flow in that diagram. Perhaps should choose one. My inclination would be to represent connections and not dataflow.

FWinkler79 commented 1 year ago

Hi @Benjamintf1! Thanks a lot for the feedback. It may well be that there are still some minor errors in the picture. After all, it was an attempt to reverse engineer the picture from the current documentation (several locations of it actually). Let me try to answer to your points:

  1. I don't believe the interface for rlp to doppler is "push" to my knowledge, at least, anymore then traffic-controller is.

Here is what the cf documentation says: "Reverse Log Proxy (V2 Firehose): Reverse Log Proxies (RLPs) collect logs and metrics from Dopplers and forward them to Log Cache and other consumers over gRPC. Operators can scale up the number of RLPs based on overall log volume."

  1. we might not need to document the rlp path to log-cache at this point on docs.cloudfoundry.org?

I think it helps understanding where you would integrate your monitoring tools. As I understand it RLP is the natural extension point for everyone would wants to bring their own monitoring backends. In fact, we use this to hook in metric and logging backends. If we are looking for a way forward to support OpenTelemetry in CF, that might also be a spot where an OTEL collector could be placed. Also, RLP is already a part of the documentation, so having it depicted in a full architecture picture would help, I think.

  1. some components could still send directly to forwarder agent that arn't rep

Not sure I understood you. Are you referring to the fact that the components above rep (i.e. health checker, route emitter) don't have any arrows to the forwarder agent? I tried to depict in the Component VMs box that there are other components that push to the forwarder agent (and that some of them may expose StatsD metrics, others may expose Prometheus). Again - all that was reverse engineered from the current documentation which is not always clear, unfortunately.

  1. I'm not sure why "short term storage" is marked as a component in log-cache. It's all in memory.

The documentation says: "Log Cache provides short-term storage for logs and metrics where the cf CLI and web UIs can access them." I was not aware that this is only in memory. But if it's a sort of buffer that's used to provide a certain short-term history, I would also say it should be on the picture. We could give it a UML stereotype of <<in-memory>> or something like that, if it helps.

Let me know, if you want me to change the picture still. I also attached the drawio file, so you could even play around with it yourself.

Benjamintf1 commented 1 year ago

The data flow from rlp is towards the consumer, but the connection, much like traffic controller, is initiated from the client. Doppler more or less only receives connections, RLP and Traffic controller is more or less the api multiplexer to all dopplers, and rlp gateway is a grpc->http gateway. So if we're tracking client connections everything should be pointing toward the gorouter, gorouter should be pointing inward towards rlp-gateway/tc/log-cache cf auth proxy, rlp gateway towards rlp, rlp towards doppler, log-cache-(*nozzle) towards rlp, log cache auth proxy towards gateway, lc-gateway towards log-cache. If it's data flow, then just flip the arrow of doppler towards trafficcontroller.

As I understand it RLP is the natural extension point for everyone would wants to bring their own monitoring backends. Not really. RLP was developed for it's own reasons, as we move towards and encourage shared nothing, we'd be encouraging using drains for that kind of need, much like how log-cache no longer has the nozzle to ingest from rlp.

We could give it a UML stereotype of <<in-memory>> or something like that, if it helps.. Sure, if you want to mark somewhere that log-cache is where the storage happens in memory, that could make sense. I don't think it makes sense to mark is as a seperate "component".

I tried to depict in the Component VMs box that there are other components that push to the forwarder agent (and that some of them may expose StatsD metrics, others may expose Prometheus). Yeah, my thought was that it might make sense to mark that in the "other vms" box that there's that third path directly to the forwarder agent.

FWinkler79 commented 1 year ago

@Benjamintf1, thanks again for the feedback!

I uploaded a new version of the diagram. See the picture attached below. Changes:

I kept RLP in the picture. I think it helps in the general understanding how data is made available. Thanks also for the info that you are advocating more towards a shared-nothing approach. We currently deploy the setup shown in the diagram. Maybe in the future, we might have to create another one for a shared-nothing approach.

Would this be fine for you now?

Thanks!

Cloud Foundry Observability Architecture drawio

Cloud Foundry Observability Architecture.drawio.zip

chombium commented 1 year ago

Hi @FWinkler79, thanks for taking time to draw the diagram and following the @Benjamintf1's suggestions.

I find the diagram pretty good and I have few suggestion for improvements:

  1. I've always found the division of Diego and Component VMs a bit confusing as people might think that the both VM "types" have different components running. Actually the Diego VMs also have CF platform component running on them. I think it would be better if we concentrate on the source and the paths taken by the logs and metrics from the applications and the platform components. We should also think about simplifying the Applications' logs and metrics path as it has too many details about the CFAR (CF Application Runtime) which are irrelevant in this case.

    There are two important things about the applications:

    1. They have to write their logs on STDOUT or STDERR
    2. The Diego, the CFAR will get the container metrics

      For platform components:

      1. They can use libraries like go-loggregator or similar to send metrics and logs directly to the Forwarder Agent
      2. Some of them expose the metrics via Prometheus scrapable endpoint which are collected with the Prometheus Scraper

    Take a look at the loggregregator-agent-release for details. I guess we can simplify and abstract few things ;) Application -> CFAR -> Forwarder Agent ...... Platform Component -> Forwarder Agent/Prom Scraper ....

    1. Syslog Agent
      1. There are two types: application and aggregate:
        1. application syslog drains are configured by the app developers and can forward application logs, container metrics or both based on the drain-type parameter
        2. aggregate drains forward all application logs, container metrics and platform component metrics from all applications and all platform components running on the VM to the specified syslog drain URL. These are used to forward everything to Logs or Metrics management system.
      2. They are the closest point to the applications (they run on the same VM where the apps are running) where the Logs and Metrics can be read out of the platform. They are suggested way to pull application logs and container metrics from CF.
      3. The connections from syslog agent to Log Cache is done with aggregate drains
  2. Log Cache:

    1. It use to be an internal Firehose consumer in the past. Now it consumes logs and metrics form the Syslog Drains via aggregate drains
    2. There is no connection between the RLPs and Log Cache nodes anymore
    3. It has its own API for log consumption which is a bit more complex to implement, but there is a library available and a CF CLI plugin
  3. There are also other logs and metrics which are related to applications and platform components which are injected to their "logs and metrics stream", for example Gorouter access logs for the applications.

  4. The Firehose Route is not suggested to be used as the Dopplers aggregate the logs and metrics from all of the VMs and as they have limited capacity, even in cases when everything is scaled properly for the load, if an application starts misbehaving and generates too many logs (if for example something is wrong and it writes too many stacktraces) it can overflow the buffers and hence some logs and metrics from other apps may be dropped. The same is valid for the platform component metrics. Therefore it is suggested to use Syslog Drains, as such disturbances are not present in this case. The other thing that might help in such if the Firehose is used, is to set an application log rate limit.

bobbygeeze commented 1 year ago

HI @chombium @Benjamintf1 and @FWinkler79 do we have a final diagram for inclusion into the docs after all the suggestions? Also, with the new diagram, are there any doc updates/revisions needed? When the diagram is finalized, we need a .png file extension for the diagram. Thanks.

chombium commented 1 year ago

@bobbygeeze I would also prefer if we add some explanation with the diagram. I guess we'll add either some paragraph on the Loggregator Architecture page or even add a new page, when everything is ready to be uploaded.

bobbygeeze commented 1 year ago

@chombium in the diagram above there is an explanation block but in the diagram there is a typo "The direction of connection setup may be differ." Should be "The direction of the connection setup might differ."

Benjamintf1 commented 1 year ago
image

These things are the same in terms of api connections. Both are streamed data initiated by a client connection.

FWinkler79 commented 1 year ago

image These things are the same in terms of api connections. Both are streamed data initiated by a client connection.

But they end up in different boxes (RLP vs. Traffic Controller). Does it harm to have them depicted as being separate, and also noting the push vs. pull nature?

@chombium in the diagram above there is an explanation block but in the diagram there is a typo "The direction of connection setup may be differ." Should be "The direction of the connection setup might differ."

Thanks for spotting it. This is fixed now.

@bobbygeeze I would also prefer if we add some explanation with the diagram. I guess we'll add either some paragraph on the Loggregator Architecture page or even add a new page, when everything is ready to be uploaded.

Is that something you would expect me to do, or is that something you would do. I'd prefer the latter, since you certainly are the experts on the topic and I am just the "reverse-engineer"... 😉

@chombium Thanks a lot for the explanations and clarifications. I tried to reflect your comments on the syslog drain types and the fact that LogCache no longer connects to RLP. The following things are still open:

  1. I was not aware that Diego Cells are essentially not any different from component VMs. However, I did not quite get what you would suggest instead. Just call both of them "VM", or merge both into one? I am not in favor of merging, as it may become messy. WDYT?
  2. Are you using Diego and CFAR interchangeably, i.e. are they the same thing?
  3. Would you suggest adding Diego to the picture?
  4. When you talk about "simplifying the Applications' logs and metrics path as it has too many details about the CFAR" what exactly did you have in mind?
  5. I did not know how to best reflect your issue number 4, i.e. logs from Gorouter. How are these injected into the application logs?
  6. Your issue number 5 (Firehose route should not be used) is something I would suggest having as a written statement in the documentation but not depicted in the diagram. We could remove Firehose altogether, but then it would miss an important piece of legacy(?) which might still be relevant for readers. For us, this is still relevant, for example.

Updated files

Cloud Foundry Observability Architecture drawio

Cloud Foundry Observability Architecture.drawio.zip

Benjamintf1 commented 1 year ago

But they end up in different boxes (RLP vs. Traffic Controller). Yes Does it harm to have them depicted as being separate, No and also noting the push vs. pull nature? Yes. Both are client initiated connections streaming data from doppler. Where are you getting this push vs pull for traffic controller(v1) vs rlp(v2)?

bobbygeeze commented 1 year ago

@FWinkler79 in the diagram above there is an explanation block but in the diagram there is a typo "The direction of connection setup might be differ." Should be "The direction of the connection setup might differ."

bobbygeeze commented 1 year ago

@chombium you said earlier " I would also prefer if we add some explanation with the diagram. I guess we'll add either some paragraph on the Loggregator Architecture page or even add a new page, when everything is ready to be uploaded." Please provide any necessary text you wanted added, thanks!

FWinkler79 commented 1 year ago

@Benjamintf1

Where are you getting this push vs pull for traffic controller(v1) vs rlp(v2)?

Good question. I checked the documentation again, and I must have misinterpreted it. Thanks for clarifying. This is fixed now. Also added one more sentence to the note below, that explains this for firehose clients.

@bobbygeeze

Should be "The direction of the connection setup might differ."

Thanks and sorry for overlooking. This should be fixed now. Latest files are below.

Updated Files

Cloud Foundry Observability Architecture drawio

Cloud Foundry Observability Architecture.drawio.zip

Benjamintf1 commented 1 year ago

Personally I'd just leave off "pull based" off both. I don't think they clarify things(at least in a way that also isn't equally true for other connections down the chain).

chombium commented 1 year ago

Hi @FWinkler79,

I'm terribly sorry for the late reply :(

  1. I was not aware that Diego Cells are essentially not any different from component VMs. However, I did not quite get what you would suggest instead. Just call both of them "VM", or merge both into one? I am not in favor of merging, as it may become messy. WDYT?

The Diego cells are not different at all than the other component VMs except that they are VMs with large resources (bnnumber of CPUs and GBs of RAM) and run Diego which runs the apps

  1. Are you using Diego and CFAR interchangeably, i.e. are they the same thing?

Yes. Diego is the CFAR

  1. Would you suggest adding Diego to the picture? We have it already... App Container, Garden, Executor...

  2. When you talk about "simplifying the Applications' logs and metrics path as it has too many details about the CFAR" what exactly did you have in mind?

The important things are that the Apps have to write their logs to stdout or stdin and the "Diego Runtime" emits the container metrics. IMO it is enough if we have something like the following: diego

On the other side despite the diagram is a bit more crowded, it might be useful to have the Diego(CFAR) internals on the diagram, to see which Diego components are involved in logs and metrics forwarding... Everything depends on the level of abstraction or detail level we want to depict. We can leave the things the way they are, but I would remove the Route Emitter which is (un)registering application routes in the Gorouter and the Health Checker which checks the app state.

  1. I did not know how to best reflect your issue number 4, i.e. logs from Gorouter. How are these injected into the application logs?

There are a few different types of log which are written when something happens to the app. There are logs written when the app is pushed, there are logs from the staging process STG, when some command is executed through the CF CLI/API there are API logs, when the Gorouter routes a request to the app RTR, APP logs when the app writes logs, etc...

The CF platform components emit logs to Loggregator that are related to an application, which then Loggregator correlates based on the source id (app guid or platform component name) and when the log consumers receive the logs, they will get everything with the same source_id

  1. Your issue number 5 (Firehose route should not be used) is something I would suggest having as a written statement in the documentation but not depicted in the diagram. We could remove Firehose altogether, but then it would miss an important piece of legacy(?) which might still be relevant for readers. For us, this is still relevant, for example.

I completely agree with you. We won't remove Firehose from the diagram, we will have to write some explanation of the diagram anyways and we'll describe the pros and cons of the Syslog Drains and Firehose.

bobbygeeze commented 1 year ago

HI @chombium. Once the final version of the diagram is agreed upon can you please provide the necessary text that is probably needed around the diagram? Thank you in advance.

FWinkler79 commented 1 year ago

Thanks @chombium!

Here is what I did:

I did not reflect the different logs related to an application and emitted from various components. I totally understood how this is done, but reflecting this in the same picture would likely make it explode.

I have one question though: In the diagram, I depicted the components in the VMs to emit logs via an rsyslog agent into a box called Platform Log Consumers (at the top of the diagram). I always wondered, if the rsyslog agents in the VMs would also send logs to LogCache and if not, how "Platform Log Consumers" are supposed to receive those logs without the rsyslog agent having to be re-configured whenever a new consumer appears or having to send to multiple consumers. How exactly do platform components expose their logs, what is rsyslog agent's role and does it connect to any other component in the diagram than just the "Platform Log Consumers"? The documentation is very vague there.

Updated Files

Cloud Foundry Observability Architecture drawio Cloud Foundry Observability Architecture.drawio.zip

chombium commented 1 year ago

Hi @FWinkler79,

thanks for the update and sorry for being late with my reply :(

I did not reflect the different logs related to an application and emitted from various components. I totally understood how this is done, but reflecting this in the same picture would likely make it explode.

That's totally fine, we will add some text description about it ;)

I have one question though: In the diagram, I depicted the components in the VMs to emit logs via an rsyslog agent into a box called Platform Log Consumers (at the top of the diagram). I always wondered, if the rsyslog agents in the VMs would also send logs to LogCache and if not, how "Platform Log Consumers" are supposed to receive those logs without the rsyslog agent having to be re-configured whenever a new consumer appears or having to send to multiple consumers. How exactly do platform components expose their logs, what is rsyslog agent's role and does it connect to any other component in the diagram than just the "Platform Log Consumers"? The documentation is very vague there.

Yet another thing that needs documentation improvement ;)

The platform component logs are not collected and processed by Loggregator. All of the platform components are started as BOSH jobs most of them with bpm, bpm-release and are "kept running" by monit. By default, all of the jobs are configured to write their logs in /var/vcap/sys/log/<job-name> folder. There are separate log files for stdout and stderr. These logs are parsed with the syslog-release, some rules are applied (which lines should be included or excluded) and the logs are packed in RFC5424 Syslog format. It practically tails all the configured log files, applies some include and exclude filters and sends them to RSYSLOG which forwards them to the external Log consumers. RSYSLOG is installed on every BOSH VM. The syslog-release adds log parsing rules and confiugration for RSYSLOG forwarding.

I think we are really close to finishing the diagram. I would only change the labels from the Loggregator agent and everywhere else from "app logs and component metrics" to "app logs, container metrics and platform component metrics". The Loggregator Agent and the Syslog Agent send the same data in different formats.

After we finish the diagram, we can think where do we pack on docs.cloudfoundry.org it and add some proper description and finish this PR ;)

FWinkler79 commented 1 year ago

Hi @chombium, thanks a lot for the explanation and links. That's really helpful. I updated the diagram according to your feedback, so I hope this is now the final version you were looking for. If not, please let me know, of course.

Updated Files

Cloud Foundry Observability Architecture drawio Cloud Foundry Observability Architecture.drawio.zip

chombium commented 1 year ago

Hi @FWinkler79,

Only few more labels :)

I think that it would be better if we return the "Diego Cell VMs" and "Other VMs"/"Platoform VMs", so that we have the same labels as in the Architecture Diagrams. It would be good if the naming is unified. Sorry for asking you change the labels earlier :(

I see only few missing or wrong labels, and then we have to write some good description and put the diagram in the docs :)

Thank you in advance.

Best Regards, Jovan

FWinkler79 commented 1 year ago

Hi @chombium,

no problem at all. I added the missing labels and reverted the names of the VMs. Also beautified the connections a bit. Hope this all works now, and I am really looking forward to seeing this online!

Thanks again! Florian

Updated Files

Cloud Foundry Observability Architecture drawio

Cloud Foundry Observability Architecture.drawio.zip

chombium commented 1 year ago

@FWinkler79, thank you very much for the prompt response and the changes. Everything looks good. I will start writing a description for the diagram and hope to open a PR by the end of the week.

bobbygeeze commented 1 year ago

It seems we have the final iteration of the image.@chombium do we have the final doc updates for CF Loggregator / Observability Architecture Needs a Big Picture

74

bobbygeeze commented 1 year ago

@chombium Topic that needs doc updates/edits to coincide with this new diagram https://docs.vmware.com/en/VMware-Tanzu-Application-Service/4.0/tas-for-vms/loggregator-architecture.html What diagrams need to go?

chombium commented 1 year ago

Hi @bobbygeeze We don't need to remove any diagrams. The existing diagrams are general diagrams that describe the whole Loggregator architecture and deployment possibilities and the diagram from this PR is a detailed diagram of the data flow. It's like a technical deep dive into the Loggregator architecture. We will pack it will appropriate description either as a paragraph on the page you've linked or as a separate page.

bobbygeeze commented 1 year ago

Thanks @chombium for the explanation. Please open a PR for this work so we can edit the text and know the placement of the information. Also, copy @anita-flegg on this work since I'll be out of the office from 9/22-10-3. Again thanks....

ctlong commented 1 year ago

Just a couple note for the future of this diagram:

No action required at the moment though :)

chombium commented 1 year ago

@ctlong I started writing the description of the diagram and will open a PR latest next week. For the time being I've only added the Otel collector and will add a description for it as well

anita-flegg commented 1 year ago

@ctlong , @chombium , @FWinkler79 , @Benjamintf1 Let's close this off so that users can gain the full benefits of these major improvements. Thanks! :)

FWinkler79 commented 1 year ago

@anita-flegg While I totally share that sentiment, I think my part is done here. Or is there anything else you still need from me?

anita-flegg commented 1 year ago

@FWinkler79 - understood. I don't know what might still be needed. If all of the contributers are done with their input, then we can finish it off in the docs. I do see that @chombium was still planning some more description.

chombium commented 1 year ago

Hi @anita-flegg, @FWinkler79, @ctlong, @Benjamintf1,

At first I'm terribly sorry for the delay. I've been really busy lately and we had some changes in Loggregator (addition of the Otel Collector), so I didn't manage to make some significant progress. I've started writing a new paragraph on the [Loggregator Architecture Page](), but than we had some Slack discussions about operating Loggregator like this one and I think it would be better if we update the Loggregator guide for Cloud Foundry operators. I want to take @FWinkler79 diagram, add description on metrics and monitoring each of the Loggregator's components and add some scaling suggestions based on the Loggregator performance tests which we've made in the past 2-3 years. Feel free to ping me if you have some other ideas or things which are missing.

bobbygeeze commented 1 year ago

@anita-flegg @FWinkler79 @Benjamintf1 and @chombium how about some information with pic here:https://docs.vmware.com/en/VMware-Tanzu-Application-Service/5.0/tas-for-vms/log-ops-guide.html

chombium commented 1 year ago

@bobbygeeze I don't have access to the TAS docs repo, someone else will have to do that.

FWinkler79 commented 1 year ago

Hi everyone, when I started my journey that led me to drawing the picture, I started here: https://docs.cloudfoundry.org/loggregator/architecture.html

There (or in one of the sub pages) I would have hoped the picture would be a nice addition and would also be found. But of course that depends how many people actually read these pages vs. the ones that @bobbygeeze suggested. I guess you will have better insights on that.

Best regards!

FWinkler79 commented 6 months ago

Folks, what's happening with this issue? Quite frankly I spent considerable amount of time with @chombium to get a picture that hits the spot. It would be a shame if that were all a waste of time. Any plans to merge this into the official documentation?

anita-flegg commented 6 months ago

@FWinkler79 , I would love to merge it, but reading the comments, I get the impression that the OTel description to go with the graphic is still incomplete. If I'm wrong, please let me know and I will merge it right away. Thanks!

FWinkler79 commented 6 months ago

@anita-flegg, the picture does not contain any OTEL bits yet. It is an architecture overview of the current CF monitoring architecture. I know that this may be subject to change in the future, but as far as I know it has not been decided what an OTEL-based version would look like. But I may be wrong, as I am not closely following the discussions. Anyway, the picture I created and refined with @chombium tries to tie together the many different subpages of the CF documentation into a big picture.

anita-flegg commented 6 months ago

@FWinkler79 , Thanks for explaining. I will add this to my list for the next few days.

ctlong commented 6 months ago

Thanks @FWinkler79! Excited to see this great improvement make it into the docs pages 😁

anita-flegg commented 6 months ago

@FWinkler79 , would you mind checking to make sure I put the new graphic in the right place? https://docs-staging.vmware.com/en/VMware-Tanzu-Application-Service/6.0/tas-for-vms/loggregator-architecture.html

If this is right, I will publish it in both the TAS and CF books. Thanks!

anita-flegg commented 6 months ago

Published - closing