Trace budgeting mechanism

svlad-90 commented 1 week ago

I'm interested in fighting trace spam risks for my current project when one domain within the complex automotive system is 'eating up' much of the dlt-daemon logging bandwidth.

I considered investigating whether extending the number of messages that dlt-daemon can process per second is possible. In my current environment, it is ~5000 messages/second, after which the dlt-daemon drops the messages with quite a significant CPU load. After my investigation, I found that improving this significantly is impossible. Also, I remember the best practices and that dlt-daemon is not intended for heavy tracing of the low-level data.

The other way is to have a per-application and ( or ) context ID trace budgeting mechanism to suppress trace spamming processes/contexts.

I've seen the following non-merged PR: https://github.com/COVESA/dlt-daemon/pull/134

So, I'm not the only one who wanted such a feature. But it was not merged; thus, before starting development, I want to cross-check with maintainers the following points:

Would it be a new feature for the dlt-daemon? Or am I missing something, and does it already exist? I'm asking because I've seen the following thread on the StackOverflow: https://stackoverflow.com/questions/72269739/ubuntu-dlt-tool-trace-load-exceeded-trace-hard-limit-1-messages-discarded It describes exactly the feature I want with warning messages when the trace-spam domain is hitting the limit: Also, I remember that when I was working on one of the OEM's projects, I had this exact feature in the dlt-daemon. The one that is described in that StackOverflow thread:

I got to know, this problem is related to trace load limits (soft limit and hard limit) mentioned in Payload column.

These limits should be set in configuration file dlt-trace-load.conf for each application which is using dlt daemon. These limits should be defined with corresponding application id. Soft_limit: The warning limit, if more data than this is logged, a warning is written into dlt. Hard_limit: If an application surpasses this limit, data will be discarded and a warning will be logged!

That's why I was almost sure it existed in the official delivery. However, I could not find any information regarding the 'dlt-trace-load.conf' file in this repository or elsewhere. So, this feature might exist in a private OEM's patched version of the dlt-daemon, and someone accidentally posted about it on StackOverflow.

=> Do maintainers know something about this implementation? Can we all get this feature without implementing it from scratch?
If not, is it OK to introduce such a feature? Or are there some significant objections to having it at all? I'm asking because previous PR related to a similar topic was rejected. That's why I would like to know beforehand that maintainers are okay with the idea of such a feature, not to throw away my team's efforts.
If you approve of implementing it, would it be OK if I create some architectural diagrams and post them to this thread to align with maintainers on the possible implementation? I want to implement it properly right away, not to spend my and your time on endless reviews.

I am looking forward to getting your feedback! ))

minminlittleshrimp commented 1 week ago

Hello @svlad-90 It is nice of you to raise your concern and your interest to DLT.

For your proposal, IMHO, I am okay with the feature, the only thing we need to worry about is making sure that the implementation will not affect the current mechanism, APIs, or violating AUTOSAR Standard/specification, and, not breaking any unittest for current features, etc I can do the validation, testing and checking for your implementation later in review phase. You can go ahead with the diagrams, mechanisms, PRs and do not worry at all, we will support you, since the last PR is closed due to the author's account inactive, and we cannot process if the contributor dropping that way. For dlt-trace-load.conf , honestly I have no idea what this file is and for 😀 Maybe you right about this is from some commercial version from some partners in the alliance.

About this point:

If you approve of implementing it, would it be OK if I create some architectural diagrams and post them to this thread to align with maintainers on the possible implementation? I want to implement it properly right away, not to spend my and your time on endless reviews.

I also not touching much on DLT tracing, just the logging, so it's fine for me to involve in this topic. I have no objection, let's work together. Looking forward to your response!

svlad-90 commented 1 week ago

Hi @minminlittleshrimp,

Thank you very much for your feedback and for being ready to collaborate!

As I'm working on a customer's project in my company, I'll need to plan these activities properly with my management. So, for your information, it might take 1-4 weeks until this task is part of the sprint, and I'm finally back with the diagrams.

But this feature seems crucial for our customer, who has chosen to use DLT as part of its technology stack, so there is a low chance that we will abandon it. ))

COVESA / dlt-daemon

Trace budgeting mechanism #643