extensive logging significantly change behavior

tomga commented 4 years ago

I'd like to ask for some ideas how to resolve problem of providing big amount of logs from arm device without influencing behavior of the system.

The problem comes from issues with debugging usb driver for rpi (#3247) where I'd like to be able to enable detailed debug of the usb driver and be able to gather and analyze those debug logs. Everyghing seems easy but Genode logging service, while extremely ease to use, is not prepared (or maybe someone will prove me wrong) to handle big amounts of logs. Or in other words current implementation has very big influence on runtime behavior of the system which makes it almost useless when debugging time critical components.

In this concrete case usb driver normally is able to initialize and gather information about connected devices. If I enable detailed debug the driver behaves so differently that it never succeeds in enumerating devices. Delays "produced" by logging cause the interplay of the driver and device impossible in required time constraints.

Generally I think there are two bottlenecks:

rpc call from the driver to core for each logged line of text

A solution to this could be introducing some buffer in the log producer component (usb driver in my case) and enabled optionally in component configuration. If I correctly understand bufferring of messages to a buffer that does not exceed 1KB should be straightforward because it does not involve changing API. Supporting bigger buffers would require extending current log service with some (packet stream based?) api that would allow transfering logs in bigger chunks than currently and would require complex log handling in core and therefore is not acceptable.

blocking writes to UART devices

Currently drivers for UART devices which perform blocking writes without any 'in memory' buffering. Buffers are limited just to builtin buffers in the device which are small and can't even consume one line of log (according to this thread uart buffers on rpi are 8 or 16 bytes depending on version; I don't know if other boards also have such small buffers).

I suspect that this problem cause bigger delays especially that it probably blocks at least whole processor core (I think I'm not mistaken in case of hw kernel but I don't know how it works in case of different kernels).

A resolution to this problem would require introducing a buffer for logs in core (it probably could be configrable too). Probably the most "interesting" part of this would be implementing interrupt driven pushing data from memory buffer to device when it's possible.

questions about this:

are there any errors in my reasoning and/or proposals?
if I implemented buffering (both in producer and in core) would you consider it as a useful feature and merge it?
do you see any other options to solve my original problem of logging that wouldn't have significant impact on working system? If possible I'd like to avoid solutions that require manual work like using JTAG or inspecting memory after restart using boot loader commands.

nfeske commented 4 years ago

Thanks for the well-articulated problem description. For cases like this, Genode features a light-weight tracing mechanism. With this mechanism, data can be logged without any system call or locking. It works by handing out a dedicated trace buffer to each traced thread.

I successfully used the tracing mechanism while enabling the first version of the USB driver for the Raspberry Pi. Without this mechanism, I certainly wouldn't have been able to succeed.

For using the tracing mechanism, please have a look at the test packages os/recipes/pkg/test-trace and os/recipes/src/test-trace_logger. In particular the latter should be of interest to you because it provides a logging infrastructure based on the tracing mechanism:

When started, the trace logger is instructed to trace a configurable set of threads. Those threads receive a local trace buffer each.
The traced threads can produce trace entries either by passing pre-defined trace points or by adding manual instrumentations using the trace function that works analogously to log but targets the trace buffer. All this content is captured in RAM only. Therefore it does not induce any substantial overhead.
After a configurable time span, the trace logger prints out all the trace buffers to the log.

Does this mechanism fit your problem?

For reading more about Genode's tracing mechanism, you may find the following parts of the documentation useful:

chelmuth commented 4 years ago

I have doubts your analysis is entirely correct and, thus in my opionion, leads to the conclusion that it may help to add extra complexity to the LOG mechanism. Therefore I'd like to point out two alternative options for your use case.

Years ago, we identified the delay introduced per LOG operation has significant impact on the system behavior esp. when debugging drivers. In this regard your observation is true: There is a sequential dependency from the logging component to writing the characters to the log backend in core, which is the UART. So, we designed and implemented the TRACE service, which is a combination of shared-memory buffers and controlled access to traced subjects via the core TRACE interface.

The second option (you may look into first) is the interception of the LOG service for your debug target component in a custom log server that caches (maybe Megabytes of) messages and dumps them later to the system LOG service or any other backend (e.g., tcp_terminal). This options breaks the direct dependency of the debug target to the slow UART and can be implemented by much less code than your proposal and also less code than a custom TRACE monitor.

tomga commented 4 years ago

@nfeske @chelmuth thank you both for your responses.

@nfeske Regarding trace mechanism I remembered about it and very briefly looked at it and got the feeling that it is not suitable for processing logs but only for some well defined data structures for events. Given that I'm trying to get differences in behavior of the same code on different devices I wanted to get extensive logs from both cases and compare them later I thought it is not possible with trace. Maybe I did not look long enough.

@chelmuth thank you for your second suggestion. It seems to be very clear solution and I'll definitely try it although it ma still be a problem that each log message requires rpc call. I don't know how important it may be.

tomga commented 4 years ago

@nfeske @chelmuth may I ask you for an advice where would be a good place for a change that would redirect logs from single component to trace? Currently I managed to redirect some portions of logs from lx_emul to go to trace but not all of them. Specifically output from printk goes to logs and my attempts to change it did not work. I'd investigate it further but I think it can be easier to redirect everything if I knew where to put such modification.

chelmuth commented 4 years ago

There's no general configuration option to achieve this. For dde_linux you may change Genode::log to Genode::trace here.

https://github.com/genodelabs/genode/blob/46fa8197a21289f54bb96c814bf62fc14d149c6e/repos/dde_linux/src/lx_kit/printf.cc#L191-L199

tomga commented 4 years ago

@chelmuth I checked your hint and it worked great. Thanks.

Although I could gather required logs for debugging I wanted to use knowledge about tracing I gathered in last days to create a more generic solution for this problem. I have a working solution that allows to redirect logs from components using policies configurable in test_logger component configuration. It is implemented in following series of commits:

27945dbcf1 trace: extend policy with log_output fd5d442963 trace: log_output policy that redirects logs to trace 0108494db2 trace: enable check in logs for trace interception - This commit is the one I'm not comfortable with. I had to change some generic api's to pass information to trace that this call comes from call to log and not trace to enable policy check for redirection and in return received info if trace has been committed. 04973b1b44 trace: enabled log_output policy in usb_hid_raw.run

94239aac80 trace_logger: avoid output of empty lines - This commit is not directly connected with above functionality but eliminates empty lines in output trace_logger for logs intercepted from dde_linux (they have '\n' at end).

If you find the idea useful I'll polish code and make changes that you suggest after review.

nfeske commented 4 years ago

Thanks for sharing this interesting line of work. I agree that this could become quite useful.

This commit is the one I'm not comfortable with.

Likewise. ;-) I think that the commit tries to solve two problems at once that should better be separated:

Introducing a way to suppress messages being sent to the LOG session.
Tracing the calls of the log, warning, and error functions.

By design, the tracing mechanism should not change the behavior of the program. It should be a passive measurement device. In your commit, you make the behavior dependent on whether the tracing is enabled or not, which goes against the grain of the design. The correct way would be to capture the trace of log output in addition to the regular interaction with the LOG session. Following this approach, we should introduce a new trace-point type in include/base/trace/events.h and use this in the default_log.cc, exactly how this is done for the existing trace points (e.g., in rpc_server.h). So the resulting patch would be much less invasive.

Of course, this alone is not what you want right now because you want to use the tracing as a lightweight alternative (not an addition) to the log. So we should introduce a way to suppress calls to the log session by the means of a config argument. When combining both features, you get what you want. I don't know yet how I would implement the toggling of log verbosity though. This would require some further thought.

Does that line of thoughts sound reasonable to you?

tomga commented 4 years ago

@nfeske I have mixed feelings about this idea. Initially you convinced me completely but the more I thought about it I had more doubts.

I'll start from undermining your main argument:

By design, the tracing mechanism should not change the behavior of the program.

Generally it is hard to discuss about it unless you start thinking about logs not as part of the program functionality but only as a means to diagnose program behavior. I can't think of cases where log output is part of program functionality. The more I think about it even output captured for tests is not by any means part of functionality. So I think ability to redirect logs to trace wouldn't contradict with not changing program functionality.

Continuing this line of thought I don't see a good reason to have the same logs/traces in both channels. Moreover having two separate configurations for "configuring" output of logs that are hard or even impossible to change in synchronized fashion is not something good.

On the other hand I like the simplicity in your proposal. But I rather expected that you'll show me a better place to put the condition that would switch output than propose different behavior. I thought that maybe it is possible to make it more deep in implementation that will not require changes in public api.

I wrote points about not treating logs as part of program functionality just to not throw away behavior of my version too quickly. Both variants solve my problem of gathering logs through trace buffer from component for some time from startup.

There could be a reason to choose my version if there are convincing use case for dynamically switching logs to traces on and off during runtime. Currently I cannot find such use case (actually I found a few but for all of them I could find a different solution) so I'll make changes in whatever direction you choose.

One minor thing I'm not sure. Is it possible to guarantee in your version that no logs at startup will be lost? If logs are disabled completely and trace policy is installed in runtime is there a time when logs would be lost? Not that I see it as a big problem. It's just a question if it is possible in current trace mechanism to guarantee policy installation in component at startup.

nfeske commented 4 years ago

@tomga you are presenting a very convincing line of arguments.

I like that you questioned my statement that the LOG output would be part of a program's behavior. The more I think about it, the more I like your thought.
With my suggestion, I completely disregarded the need for a clean hand-over between the direction of output to the LOG or the TRACE services. My proposal has no good solution for that.

That leaves the question of how to make the commit https://github.com/genodelabs/genode/commit/0108494db24416bac181b1ff68a77af88d5b7ae2 a little less invasive.

What do you think about making the log output depend on the outcome of the trace policy (len > 0) instead of introducing the check_policy argument? That would imply that the trace policy should generate at least one byte of information even the logged string is empty. But this should be doable by capturing a zero value.

Instead of adding the bool return value to the existing Thread::trace method, I would prefer the introduction of a second method trace_captured that is called at the only place where the return value is of interest (default_log.cc). As the rationale behind that, in the future, I'd like to statically check all our code for unconsumed return values (which would be potential errors). So if a return value is not of interest in the typical case of an API call, the API should not bother the caller with a return value.

tomga commented 4 years ago

@nfeske I am pleased that I've been able to convince you.

What do you think about making the log output depend on the outcome of the trace policy (len > 0) instead of introducing the check_policy argument?

I think I did not explain the reason for introducing check_policy argument and you did not see that trace policy outcome (len = policy_module->log_output(dst, msg, len);) is already used. Introducing check_policy was needed to not change behavior of existing direct calls to trace. I did not find acceptable changing any current behavior and that's why I couldn't apply policy to all calls to trace.

Instead of adding the bool return value to the existing Thread::trace method, I would prefer the introduction of a second method trace_captured ...

If I correctly understand that would require introducing additional state in Trace::Logger to remember result of last call to policy and two new functions:

bool Thread::trace_captured();
bool Trace::Logger::trace_captured();

to access this information. And there wouldn't be any substitute for check_policy which allows to distinguish direct calls to trace from those from attempt to redirect logs.

So maybe it would be acceptable to introduce:

bool Thread::trace_log(const char*);
bool Trace::Logger::trace_log(char const *msg, size_t len);

Returning bool from new function used only for attempt to trace logs wouldn't be a problem and "checking policy" would be an obvious consequence of using function trace_log instead of trace.

What do you think?

tomga commented 4 years ago

One addition to check_policy. It was introduced to not change current behavior that if tracing is enabled for a component than calls to trace in that component are pushed to trace buffer unconditionally. I find this behavior controversial as trace monitor has no way to specify that it is not interested in this information (e.g. only in rpc).

nfeske commented 4 years ago

bool Thread::trace_log(const char*);
bool Trace::Logger::trace_log(char const *msg, size_t len);

That's what I had in mind, just calling the methods trace_captured instead of trace_log (to avoid tying them to the specific log use case).

I find this behavior controversial as trace monitor has no way to specify that it is not interested in this information (e.g. only in rpc).

The manually placed trace calls are not meant to become regular part of the code but serve as debugging utility only. They accommodate use cases of manual (temporary) instrumentation. Just like debug messages, such instrumentations should never reach the master branch. In contrast, the more formal trace points (for RPC, signals) are always compiled in. They are designated for the monitoring of unchanged components.

tomga commented 4 years ago

That's what I had in mind, just calling the methods trace_captured instead of trace_log (to avoid tying them to the specific log use case).

I completely misunderstood your intentions in this case. Now I think everything is clear and I will prepare patches for review.

tomga commented 4 years ago

@nfeske branch issue_3714 contains adapted changes rebased on current staging.

I added there a commit in trace_logger that you did not comment on yet.

Please review when time permits.

tomga commented 4 years ago

@nfeske I attempted to extend trace_logger test for new policy. Behavior seems to be ok but I failed to find out what output pattern should I add to make test pass. I don't know exact rules for matching there and output on base linux gave reports in different sequence than in expected log output. Maybe you'll be able to quickly spot the problem.

tomga commented 4 years ago

I found a problem in pattern in test and pushed again with fix now.

nfeske commented 4 years ago

@tomga please excuse my delayed response. Your new commit series looks perfect! Thank you for this nice feature and the great discussion. :-)

chelmuth commented 4 years ago

Merged to staging.

genodelabs / genode

extensive logging significantly change behavior #3714