GMLC-TDC / HELICS

Hierarchical Engine for Large-scale Infrastructure Co-Simulation (HELICS)
https://docs.helics.org/en/latest/
BSD 3-Clause "New" or "Revised" License
127 stars 40 forks source link

Improved Broker diagnostics #450

Open bpalmintier opened 6 years ago

bpalmintier commented 6 years ago

Feature Request / Support Question

The verbose outputs from helics_broker can be very helpful in debugging, but currently there are a few areas for improvement:

phlptp commented 6 years ago

Proposed log level definitions for brokers/cores error/warning --as expected summary: connection/init(with summary)/termination normal: summary + fed connection/disconnection+some remote connection info for brokers debug: normal +interface registration, interface connections, exec entry/ time grant for each federate that passes through. trace: all messages

Proposed log level definitions for federates error/warning --as expected summary: connections/state changes normal: same as summary debug: normal +interface registration, interface connections, exec entry/ time grant trace: all messages

bpalmintier commented 6 years ago

I think this sounds about right, but am confused how pub, sub, and messages fit in.

phlptp commented 6 years ago

pub/sub/endpoints/filters the registration and connections are part of the interface connections section.

I put those as part of the debug level. The other option is put the interface connections as part of the normal level.

And put the actual data communication as part of the debug level vs trace? Trace is all messages but what goes in the

debug vs normal vs summary is in question. or whether they should be renamed to something else is a question.

For my own purposes I either want warnings/errors or trace since I often am debugging the timing. I just haven't put that much effort into the other levels. So I am trying to figure out what would be useful for different use cases.

I would think seeing the federate connections at a normal level would be useful, But what is appropriate for a debug level that isn't everything?

shwethanidd commented 6 years ago

@trevorhardy might have some suggestions from the use case point of view. He wanted a diagnostics tool that could provide information about the data traffic at a higher level (at normal level).

trevorhardy commented 6 years ago

When trouble-shooting with FNCS, the first step is to make sure that all the federates have registered. Having the information available at a higher level (like you do at "summary") is good.

The next thing we often do is make sure everybody is requesting and being granted a meaningful time. Something like the "normal" level of the broker in the hierarchy you've put together Phil.

If after looking at the time requests each federate is making a problem is still clearly evident, almost always the next step is to start digging into actual values being exchanged.

From a user standpoint, debugging a federation almost always boils down to watching the time requests, time grants, and message values. This may lead back to the source code of the individual federate but that is clearly a higher-level problem that we hope most users don't have to dig into (unless they wrote the simulator).

I would propose that we add a "timing+values/messages" level that would be most useful for the user but avoid all the messages a develop like @phlptp would need. Like @bpalmintier said, we would want federate names displayed at this level and probably we would need it to be supported both on the broker and on the federate side.

phlptp commented 6 years ago

So you are thinking some level between debug and trace which includes the data values being transmitted?

trevorhardy commented 6 years ago

I will answer that by saying it should be at a level that is above the need for knowing what specific APIs are being called. (I don't know what "interface registrations" are so I couldn't say whether those would be helpful or not.)

phlptp commented 6 years ago

Interface registrations are the a federate defining its publications/subscriptions/inputs/endpoints/filters and then connecting them to other publications/endpoints/inputs.

trevorhardy commented 6 years ago

Then, yes, I would say below debug and above trace is perfect.

bpalmintier commented 6 years ago

I like where this is headed. One clarifying point is that I think the "ping" tracking for confirming up time is only needed at the most thorough (trace) level. Currently it occurs earlier and can start to dominate the display unnecessarily for federate and message tracking.

Looking back at the current breakdown, I wonder if the currently sparse lower levels might be shuffled/combined so that we can fit the data exchange into one of the existing levels. This might look like (for both brokers/cores and federates, and using suggested name revisions):

  1. error/warning (only): Silent except for errors/warning
  2. summary: connection/init(with summary)/termination. Summary at enter execution state might include counts of various features, such as number of value feds, msg feds, filters, publications, subscriptions, endpoints, events, etc. As well as the min/max for minimum time steps by federates. Then another summary at critical events, at least at the end that shows the final time and maybe number of time steps and the number of exchanges by type.
  3. "connections": summary + fed connection/disconnection +interface registration, interface connections + some remote connection info for brokers
  4. timing: connections + exec entry/ time grant for each federate. Might it also be possible to flag which federate took the longest to act?
  5. debug: connections + data sent/received.
  6. trace: all messages, including pings
phlptp commented 6 years ago

Would you want an ability to discriminate between a federate connection and an interface connection? The reason I ask is that there could be a large difference in the number of messages at each of those levels. Say there were 10 federates but if each had 10-20 interfaces that is a significant difference(20x or more) in the number of messages that display. Especially at the broker level.

One thing to keep in mind is that by design no level or component will have complete global information, so the brokers are not likely to be aware of much of the traffic going or be able to compute timing information. We talked about some of this in #438 as well. I think they are related issues.

phlptp commented 6 years ago

I think what we might do in the interest of not getting bogged down is tweak some of this in a PR I am working on related to some logging stuff, and start down this path, then over time tune it to make it look more like what we are outlining here. I think it is going to be an iterative process to get this to an acceptable useful state.

This is a good discussion!

phlptp commented 6 years ago
no_print = -1,  //!< never print
    error = 0,  //!< only print errors
    warning = 1,  //!< print/log warning and errors
    summary = 2, //!< print/log summary information
    connections = 3,  //!< print summary+ federate level connection information
    interfaces = 4,  //!< print connections +interface level connection information
    timing =5,  //!< print interfaces+ timing(exec/grant/disconnect)
    data=6,  //!< print timing+data transmissions
    trace = 7,  //!< trace level printing (all processed messages)

here are the current levels I am looking at.

trevorhardy commented 6 years ago

Looks good to me, at least good enough for us to try using and see how it goes.

phlptp commented 6 years ago

The logging levels have been integrated and logging messages, but more needs to be added to finish this issue. That will probably come in the 2.0 branch, as I am not 100% sure there will be any more substantive releases in 1.3.X series.