[RFC] trappy: add support for event callbacks

joelagnel commented 7 years ago

This series adds event callbacks support to trappy so that trappy users can register callbacks for trace events, that can be used to drive a state machine to do some analysis.

Example analysis script using this is here: https://github.com/joelagnel/lisa/blob/40047eab82ed43152ef9d587ba89c5145cdb5d5b/get_wlat_cbs.py (disclaimer, the example script in the above link is very wip and needs to cleaned up and polished properly)

joelagnel commented 7 years ago

Hi Brendan, regarding your comment "Is the idea to avoid building the DataFrame for performance reasons? Is it parsing the trace that takes too long?".

Performance is part of the reason, but also that its impossible to use data frames in all situations. The data frames are build per-event, so you have 1 df per-event, like 1 for sched_switch and 1 for sched_wakeup. Inorder to build analysis script that process a state machine, the .apply() will not solve the problem because you would .apply the first data frame (which corresponds to the first event), and then .apply on the second data frame for the next event.

However, what is needed is something that processes event as a function of time. So for example, if the events are:

Time   event
1       switch
2       wakeup
3       switch

Then, using .apply, it will be perceived by the program:
switch_df.apply(..)
wake_df.apply(..)

As:
apply(1)
apply(3)
apply(2)

Instead what is needed is:
apply(1)
apply(2)
apply(3)

I don't see how this can be done using data frames. Even if it can be done using dfs, it will still not be as fast. Using in data frames in WIP script [1], I found that its 3 times slower than [2]. I don't think we should make using data frames as a rule if something can be solved in a different cleaner faster way. Also, citing the above example - I am not convinced that using data frames can be used to solve all analysis problems using joins and other data frame operations but I'd love to proved wrong!

As such, the API I am proposing is clean so I don't see an issue with it if someone doesn't want to use data frames for a particular usecase as I am. [1] https://github.com/joelagnel/lisa/blob/wake-lat-script/get_wlat.py [2 ]https://github.com/joelagnel/lisa/blob/wake-lat-script/get_wlat_cbs.py

bjackman commented 7 years ago

Ah, I see. I think I will have to take some time to get to grips with your use case and how it compares to the latency analysis stuff in LISA.

It can take a bit of experimentation, but Pandas can be really fast when you find the right way of expressing your operations. The API addition could definitely make sense but first I think we should be really certain that it's necessary.

derkling commented 7 years ago

An alternative and more PANDAS friendly way to perform that apply "in order" is to "join" the two DF and operate in the new one, which is something I've did here: https://github.com/ARM-software/lisa/blob/master/libs/utils/analysis/latency_analysis.py#L53

BTW, that API returns the latency DF which, AFAIU, it's exactly what your example above suggests. IOW, if you are after a latency analysis support perhaps we already have what you need in LISA.

This notebook: https://github.com/ARM-software/lisa/blob/master/ipynb/examples/trace_analysis/TraceAnalysis_TasksLatencies.ipynb shows how to get DF, stats and plots for both wakeup and preemption latencies.

joelagnel commented 7 years ago

Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast)

sinkap commented 7 years ago

I assigned myself to this PR, since it seems like an interesting feature, will take a look at this later in the day.

bjackman commented 7 years ago

I took a look through your example scripts and got up to speed on LISA's latency_analysis module and they do the same thing. Have you tried LISA's analysis module to see how it performs on the traces you're looking at?

Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast)

I can't think of any use cases where you can't solve it using joins. Also I can't see how the callbacks can be (necessarily) faster than using DataFrame.apply - the only thing that adds overhead is constructing the joined DataFrame which can be done pretty fast.

I'm not a big fan of the fact that the callback method can only happen when you construct the trace object. I feel the constructor should really be about building & collating data structures, and rich analysis tools should be provided as methods on an existing object. This isn't just an abstract artisanal concern, it's also true that having the methods that call your callbacks is a more flexible solution as you can have your callbacks called multiple times.

So.. you start thinking "how would I implement methods to provide these callbacks?" and you pretty quickly get to using DataFrame joins and applys..

joelagnel commented 7 years ago

@bjackman @derkling, how do you feel about API that looks like: ftrace_object.apply(callbacks = { 'sched_switch': switch_cb, 'sched_wakeup': wake_cb })

Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with?

bjackman commented 7 years ago

@bjackman @derkling, how do you feel about API that looks like: ftrace_object.apply(callbacks = { 'sched_switch': switch_cb, 'sched_wakeup': wake_cb })

Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with?

Yeah, I think if we were to have a callback API, that would be a sensible design. But the reason I brought it up is that if it's just a thin wrapper around DataFrame operations, it's not much use - although it might be useful for encapsulating some awkwardness that's necessary to make it fast.

joelagnel commented 7 years ago

Posted some more patches to rearrange things and improve the callbacks. I tried to convert the data_dict to pandas.Series and pass that to callback, but my 100MB trace parsing for latency analysis went from 16s to 1m 42s so I am leaving it in data_dict form for now. Also about the join idea, I don't think that'll work well in cases where you have a lot of events. In this case, you would have Nevents * (avgs columns/event) number of columns in the final joined frame. Considering the slowness and the memory usage we've seen, I don't even want to go there. At the moment, I think my implementation is the best solution we have performance-wise and it doesn't affect functionality for folks not using this stuff, so I think its worthwhile to consider. The modifications I made are also simple and non-invasive. I will let you experts decide ;-)

joelagnel commented 7 years ago

I think I broke a test that expects the normalize_time function which I removed because of lack of need/users. Since this is an RFC still, I'll update it a bit later.

joelagnel commented 7 years ago

One of the draw backs of using df joins instead of this method I am proposing is the join messes time correlation when several events have the same timestamp. In this case its not possible to know which order the events were received on the same CPU. This can lead to incorrect analysis...

Whereas when iterating over a trace file, as done by my patch, you do go over the events in the same order as they were recorded on the local CPU, even if the timestamps of the events were the same on the same CPU.

Here's what happens after join of sched_switch and sched_wakeup dfs. You essentially lose all information of whether switches happened first or wake ups did. Infact after the join, the df incorrect shows that the the switch-in to switch-out (pid 2700 -> 701) happened twice where as in the trace it happened just once.

time wake-pid switch-in-pid switch-out-pid 0.557056 5176.0 701.0 2700.0 0.557056 5263.0 701.0 2700.0

Also, I am now down to 45 seconds and I can't get it any lower (barring the obvious correctness issues). For this and the above reasons, I feel joining is simply inadequate, not scalable and just plain wrong. So I am still falling back to my proposal for event-callbacks. Its also confusing from the code when you do a join, each row represents columns from events that have nothing to do with each other, causing confusing logic in the code.

joelagnel commented 7 years ago

I tried multi-level index (where I store line number along with time) and use both as a combined index for the df. While this solves the problem of events with same timestamps, the '.apply' has some serious performance issues with multilevel indexes: Even when my .apply function just returns and does nothing, .apply still takes forever: 2017-05-14 18:16:17 <- start 2017-05-14 18:16:38 <- parse trace file and build df (21 s) 2017-05-14 18:16:39 <- join (1s) 2017-05-14 18:17:22 <- .apply f() which just does return (43s!)

here are the lengths of the data frames:

len(wake) 164045 len(switch_in) 270216 len(df) # joined 434261

I think I and others have spent enough time on getting dfs to do what the event callbacks do, and its clear to me atleast that there are enough disadvantages with df apply that event callbacks is a good enough idea and there's a need for it.

joelagnel commented 7 years ago

Hi, So I got it down to 33s using just join! But for that I had to do the following:

instead of multi-level index with time + line# , I just did a single level index with line#, that solves the problem with duplicates that I had and makes .apply fast. I added a separate column for time in base.py since it isn't the index anymore. Also normalize broke but that can probably be fixed easily later. Total time: 16s Analysis (.apply): 17s Do you think its Ok to make 'time' a column instead of an index, since it looks like it isn't unique anyway? Let me know if that's Ok to do (I know other parts or users may need updates so it might be a bit painful but maybe those users should access time using a more generic/abstracted API anyway).

joelagnel commented 7 years ago

With https://github.com/ARM-software/trappy/pull/247 , I am down from 37s to 24s in my usecase (when I process traces for the second time, first time always is slow). Almost all of the time is spent in .apply(). This is again with single-level indexing.

bjackman commented 7 years ago

Yep, I think using time as a separate column is valid. I wonder if in this particular case we can just use a priory knowledge about the nature of sched_switch and sched_wakeup to solve the issue? I.e. if we see a task waking up and switching in with the same timestamp we know that the wakeup happened first.

However I think your solution of using the line number as index and making time a column is better - I think it's more generic and simpler.

There's no way we can make line-number-as-index the default format for TRAPpy's DataFrames, that's just too fundamental a change. But I think it could be worth having some abstraction to solve this kind of issue i.e. something like this callback API, implemented using the join and apply with a line number index. Or perhaps even just an API to get a DataFrame of combined events with a line number index (i.e. just the join part). I'm interested to get @sinkap's input on this..

joelagnel commented 7 years ago

So @sinkap , what do you think? ;-)

sinkap commented 7 years ago

So, I remember handling such a situation before:

https://github.com/ARM-software/trappy/blob/5d488a4f480e12e9276cf8ded997eefd30bb3c94/trappy/utils.py#L29

Where we simply add a very small value to the conficting index. The problem with this approach and using the line number as the index as well is that what if the sequence is not correctly generated and we end up getting sched_switch events before the sched_wakeup events. This would some break correctness assumptions for any behaviour code.

We should probably have an interface to allow the user to specify a way to resolve conflicts. e.g. a list of events in chronological ordering and we could default to using line numbers (or the handle_duplicate_index)

joelagnel commented 7 years ago

Ok. To begin with, can we add a __line_num column to every event data frame? That will be needed anyway to do the default line number resolving you're suggesting.

joelagnel commented 7 years ago

Also how will the ordering problem also be there for line numbers? Maybe you're talking about some other line number? The line number I mentioned is incremented for every line in the trace during parsing so it's guaranteed to be unique.

sinkap commented 7 years ago

I am not saying it wont be unique. What I am saying is that if the timing ends up colliding and we get a sched_switch in a line before the sched_wakeup we could get semantically incorrect data, which is why we can provide some way for the user to specify a conflict resolution,

For example

LineNum                 TimeStamp     event                      pid
13                        000012               sched_switch         8086
14                        000012               sched_wakeup      8086

We do resolve the problem of index collision but not of semantically correct ordering.

sinkap commented 7 years ago

and yes we can add __line_num but remember we wold still get collisions on a graph when plotted agains time (e.g on a timeline). Would it not be better to add an epsilon to the timestamp as well?

joelagnel commented 7 years ago

Ok, can we do that as a next step and first add line numbers to all data frames? This is the most common case. You literally have duplication of events during joins of event dfs without this.

joelagnel commented 7 years ago

Sorry missed your last comment, just refreshed. Can you explain more about epsilon adding? I didn't get that

sinkap commented 7 years ago

Have a look at https://github.com/ARM-software/trappy/blob/5d488a4f480e12e9276cf8ded997eefd30bb3c94/trappy/utils.py#L29.

Why not just use this to handle index duplication?

joelagnel commented 7 years ago

Oh I think you were referring to your earlier comment about adding deltas to resolve the time. Actually what I don't like about that is it kind of screws up the timestamp as reported by the trace in the sent it's not what's in the trace anymore right? I would like to preserve the trace data as much as possible. I do like the idea about resolving collisions themselves so they can probably pass s lambda function if they really want to resolve it their way - like by adding epsilons and what not.

joelagnel commented 7 years ago

In the sense*

joelagnel commented 7 years ago

Users resolving conflicts themselves* Sorry for the errors in typing , I'm on a phone here :)

joelagnel commented 7 years ago

Regarding only using that index duplication stuff in the commit you pointed at instead of line numbers, consider the case where you have multiple different events on the same CPU that have the same timestamp. Now since you separate them out during parsing you lose the trace ordering completely. Then when you join them later you cannot reconstruct the original ordering because you lost that info no matter what index resolution algorithm you choose. But with line numbers added to the dfs, you preserve it and order correctly on join. Line numbers provide hidden information about ordering on the same CPU so it's kind of a timestamp in some sense.

bjackman commented 7 years ago

@sinkap

What I am saying is that if the timing ends up colliding and we get a sched_switch in a line before the sched_wakeup we could get semantically incorrect data

Can this actually happen?

derkling commented 7 years ago

We should consider that sched_switch and sched_wakeup events for a give task cab be generated on different CPUs, isn't it? In that case we could expect a time delta between the clock of the two CPUs but I would say that ftrace take care about timing differences... what instead I'm not sure is that ftrace can somehow force an ordering when it merge events form different CPU's buffers which appears to have the same timestamp... in that case probably it's just appending events in per-cpu ID order... :-/

joelagnel commented 7 years ago

About Patrick's point, I don't think ftrace does any intelligent merging in the unlikely event of timestamps on different CPUs colliding. In that case any post-processing index resolution method cannot help correct the ordering because the ordering information is lost so its beyond the scope of this discussion I feel (I feel its more of an ftrace issue than a trappy one). Where line numbers will help is timestamps on same cpu colliding which we should address first.

joelagnel commented 7 years ago

So just summarizing the email discussion to enlist the cons of apply with DF vs event callbacks (this PR):

The following are the disadvantages of doing an analysis with DF join+apply without using an event callback approach as I initially posted in this PR:

You need to specify which columns to filter before doing the joins
prefixes need to be added to colliding columns just to keep the join happy
columns not relevant to an event will still appear during the callback although with NaN values. This is a waste of memory I feel especially when doing apply on more than 2 events.
Speed - when you have more than 2 events I it is slow as we've seen (the cache PR should improve things though and is still good to go but its still just working around the real problem). Also, you will have NR_COLUMNS_PER_EVENT * NR_EVENTS total columns in the final join for every event which is a waste of memory.
Line numbers need to be specified to prevent timestamp collisions as mentioned last few comments.

It would be nice if we can all agree on approach by next week since it has been a while since this is posted (I would even say we get on a call on next week with @sinkap and @JaviMerino as well and finalize which direction to go).

joelagnel commented 7 years ago

Also one more comment, since #243 also is going in the direction of adding callbacks to events for parsing, it could help that PR.

joelagnel commented 7 years ago

So I tried something like the following on a 100MB systrace, but the trace time went up from 33s to 1 minute. This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles. Any ideas on how to speed it up? This is for measuring wakeup latencies:

sit = switch_in.iterrows()
wit = wake.iterrows()

s = sit.next()
w = wit.next()

while s and w:
        if s[1]['__line'] < w[1]['__line']:
                switch_cb(s)
                s = next(sit, None)
                continue
        else:
                wake_cb(w)
                w = next(wit, None)
                continue

while s:
        switch_cb(s)
        s = next(sit, None)
        continue

while w:
        wake_cb(w)
        w = next(wit, None)
        continue

bjackman commented 7 years ago

This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles.

Do you mean you tried it with a join and apply like in my example, and it took twice as long?

joelagnel commented 7 years ago

No with join and apply it was much faster. But I was thinking if in the future if we didn't want to use join and apply for the reasons mentioned above, then how fast would it be. Also I think your example wasn't working for my particular trace as it wasn't handling all corner cases (I'm not sure whether join and apply will work there but I will give it a look soon)

ARM-software / trappy

[RFC] trappy: add support for event callbacks #241