Closed joelagnel closed 7 years ago
Hi Brendan, regarding your comment "Is the idea to avoid building the DataFrame for performance reasons? Is it parsing the trace that takes too long?".
Performance is part of the reason, but also that its impossible to use data frames in all situations. The data frames are build per-event, so you have 1 df per-event, like 1 for sched_switch and 1 for sched_wakeup. Inorder to build analysis script that process a state machine, the .apply() will not solve the problem because you would .apply the first data frame (which corresponds to the first event), and then .apply on the second data frame for the next event.
However, what is needed is something that processes event as a function of time. So for example, if the events are:
Time event
1 switch
2 wakeup
3 switch
Then, using .apply, it will be perceived by the program:
switch_df.apply(..)
wake_df.apply(..)
As:
apply(1)
apply(3)
apply(2)
Instead what is needed is:
apply(1)
apply(2)
apply(3)
I don't see how this can be done using data frames. Even if it can be done using dfs, it will still not be as fast. Using in data frames in WIP script [1], I found that its 3 times slower than [2]. I don't think we should make using data frames as a rule if something can be solved in a different cleaner faster way. Also, citing the above example - I am not convinced that using data frames can be used to solve all analysis problems using joins and other data frame operations but I'd love to proved wrong!
As such, the API I am proposing is clean so I don't see an issue with it if someone doesn't want to use data frames for a particular usecase as I am. [1] https://github.com/joelagnel/lisa/blob/wake-lat-script/get_wlat.py [2 ]https://github.com/joelagnel/lisa/blob/wake-lat-script/get_wlat_cbs.py
Ah, I see. I think I will have to take some time to get to grips with your use case and how it compares to the latency analysis stuff in LISA.
It can take a bit of experimentation, but Pandas can be really fast when you find the right way of expressing your operations. The API addition could definitely make sense but first I think we should be really certain that it's necessary.
An alternative and more PANDAS friendly way to perform that apply "in order" is to "join" the two DF and operate in the new one, which is something I've did here: https://github.com/ARM-software/lisa/blob/master/libs/utils/analysis/latency_analysis.py#L53
BTW, that API returns the latency DF which, AFAIU, it's exactly what your example above suggests. IOW, if you are after a latency analysis support perhaps we already have what you need in LISA.
This notebook: https://github.com/ARM-software/lisa/blob/master/ipynb/examples/trace_analysis/TraceAnalysis_TasksLatencies.ipynb shows how to get DF, stats and plots for both wakeup and preemption latencies.
Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast)
I assigned myself to this PR, since it seems like an interesting feature, will take a look at this later in the day.
I took a look through your example scripts and got up to speed on LISA's latency_analysis
module and they do the same thing. Have you tried LISA's analysis module to see how it performs on the traces you're looking at?
Its totally possible to do this with join and sorting, but this trappy feature is useful for cases of analysis where you cannot join for example and still need to process in a time-correlated fashion (and also do it fast)
I can't think of any use cases where you can't solve it using joins. Also I can't see how the callbacks can be (necessarily) faster than using DataFrame.apply
- the only thing that adds overhead is constructing the joined DataFrame
which can be done pretty fast.
I'm not a big fan of the fact that the callback method can only happen when you construct the trace object. I feel the constructor should really be about building & collating data structures, and rich analysis tools should be provided as methods on an existing object. This isn't just an abstract artisanal concern, it's also true that having the methods that call your callbacks is a more flexible solution as you can have your callbacks called multiple times.
So.. you start thinking "how would I implement methods to provide these callbacks?" and you pretty quickly get to using DataFrame
joins and apply
s..
@bjackman @derkling, how do you feel about API that looks like: ftrace_object.apply(callbacks = { 'sched_switch': switch_cb, 'sched_wakeup': wake_cb })
Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with?
@bjackman @derkling, how do you feel about API that looks like: ftrace_object.apply(callbacks = { 'sched_switch': switch_cb, 'sched_wakeup': wake_cb })
Internally trappy can then go over the dataframes and then do whatever it wants with it like join them, and then apply the callbacks passed. Do you think that's something that you're Ok with?
Yeah, I think if we were to have a callback API, that would be a sensible design. But the reason I brought it up is that if it's just a thin wrapper around DataFrame operations, it's not much use - although it might be useful for encapsulating some awkwardness that's necessary to make it fast.
Posted some more patches to rearrange things and improve the callbacks. I tried to convert the data_dict to pandas.Series and pass that to callback, but my 100MB trace parsing for latency analysis went from 16s to 1m 42s so I am leaving it in data_dict form for now. Also about the join idea, I don't think that'll work well in cases where you have a lot of events. In this case, you would have Nevents * (avgs columns/event) number of columns in the final joined frame. Considering the slowness and the memory usage we've seen, I don't even want to go there. At the moment, I think my implementation is the best solution we have performance-wise and it doesn't affect functionality for folks not using this stuff, so I think its worthwhile to consider. The modifications I made are also simple and non-invasive. I will let you experts decide ;-)
I think I broke a test that expects the normalize_time function which I removed because of lack of need/users. Since this is an RFC still, I'll update it a bit later.
One of the draw backs of using df joins instead of this method I am proposing is the join messes time correlation when several events have the same timestamp. In this case its not possible to know which order the events were received on the same CPU. This can lead to incorrect analysis...
Whereas when iterating over a trace file, as done by my patch, you do go over the events in the same order as they were recorded on the local CPU, even if the timestamps of the events were the same on the same CPU.
Here's what happens after join of sched_switch and sched_wakeup dfs. You essentially lose all information of whether switches happened first or wake ups did. Infact after the join, the df incorrect shows that the the switch-in to switch-out (pid 2700 -> 701) happened twice where as in the trace it happened just once.
time wake-pid switch-in-pid switch-out-pid 0.557056 5176.0 701.0 2700.0 0.557056 5263.0 701.0 2700.0
Also, I am now down to 45 seconds and I can't get it any lower (barring the obvious correctness issues). For this and the above reasons, I feel joining is simply inadequate, not scalable and just plain wrong. So I am still falling back to my proposal for event-callbacks. Its also confusing from the code when you do a join, each row represents columns from events that have nothing to do with each other, causing confusing logic in the code.
I tried multi-level index (where I store line number along with time) and use both as a combined index for the df. While this solves the problem of events with same timestamps, the '.apply' has some serious performance issues with multilevel indexes: Even when my .apply function just returns and does nothing, .apply still takes forever: 2017-05-14 18:16:17 <- start 2017-05-14 18:16:38 <- parse trace file and build df (21 s) 2017-05-14 18:16:39 <- join (1s) 2017-05-14 18:17:22 <- .apply f() which just does return (43s!)
here are the lengths of the data frames:
len(wake) 164045 len(switch_in) 270216 len(df) # joined 434261
I think I and others have spent enough time on getting dfs to do what the event callbacks do, and its clear to me atleast that there are enough disadvantages with df apply that event callbacks is a good enough idea and there's a need for it.
Hi, So I got it down to 33s using just join! But for that I had to do the following:
With https://github.com/ARM-software/trappy/pull/247 , I am down from 37s to 24s in my usecase (when I process traces for the second time, first time always is slow). Almost all of the time is spent in .apply(). This is again with single-level indexing.
Yep, I think using time as a separate column is valid. I wonder if in this particular case we can just use a priory knowledge about the nature of sched_switch
and sched_wakeup
to solve the issue? I.e. if we see a task waking up and switching in with the same timestamp we know that the wakeup happened first.
However I think your solution of using the line number as index and making time a column is better - I think it's more generic and simpler.
There's no way we can make line-number-as-index the default format for TRAPpy's DataFrames, that's just too fundamental a change. But I think it could be worth having some abstraction to solve this kind of issue i.e. something like this callback API, implemented using the join
and apply
with a line number index. Or perhaps even just an API to get a DataFrame of combined events with a line number index (i.e. just the join
part). I'm interested to get @sinkap's input on this..
So @sinkap , what do you think? ;-)
So, I remember handling such a situation before:
Where we simply add a very small value to the conficting index. The problem with this approach and using the line number as the index as well is that what if the sequence is not correctly generated and we end up getting sched_switch
events before the sched_wakeup
events. This would some break correctness assumptions for any behaviour code.
We should probably have an interface to allow the user to specify a way to resolve conflicts. e.g. a list of events in chronological ordering and we could default to using line numbers (or the handle_duplicate_index)
Ok. To begin with, can we add a __line_num column to every event data frame? That will be needed anyway to do the default line number resolving you're suggesting.
Also how will the ordering problem also be there for line numbers? Maybe you're talking about some other line number? The line number I mentioned is incremented for every line in the trace during parsing so it's guaranteed to be unique.
I am not saying it wont be unique. What I am saying is that if the timing ends up colliding and we get a sched_switch
in a line before the sched_wakeup
we could get semantically incorrect data, which is why we can provide some way for the user to specify a conflict resolution,
For example
LineNum TimeStamp event pid
13 000012 sched_switch 8086
14 000012 sched_wakeup 8086
We do resolve the problem of index collision but not of semantically correct ordering.
and yes we can add __line_num
but remember we wold still get collisions on a graph when plotted agains time (e.g on a timeline). Would it not be better to add an epsilon to the timestamp as well?
Ok, can we do that as a next step and first add line numbers to all data frames? This is the most common case. You literally have duplication of events during joins of event dfs without this.
Sorry missed your last comment, just refreshed. Can you explain more about epsilon adding? I didn't get that
Have a look at https://github.com/ARM-software/trappy/blob/5d488a4f480e12e9276cf8ded997eefd30bb3c94/trappy/utils.py#L29.
Why not just use this to handle index duplication?
Oh I think you were referring to your earlier comment about adding deltas to resolve the time. Actually what I don't like about that is it kind of screws up the timestamp as reported by the trace in the sent it's not what's in the trace anymore right? I would like to preserve the trace data as much as possible. I do like the idea about resolving collisions themselves so they can probably pass s lambda function if they really want to resolve it their way - like by adding epsilons and what not.
In the sense*
Users resolving conflicts themselves* Sorry for the errors in typing , I'm on a phone here :)
Regarding only using that index duplication stuff in the commit you pointed at instead of line numbers, consider the case where you have multiple different events on the same CPU that have the same timestamp. Now since you separate them out during parsing you lose the trace ordering completely. Then when you join them later you cannot reconstruct the original ordering because you lost that info no matter what index resolution algorithm you choose. But with line numbers added to the dfs, you preserve it and order correctly on join. Line numbers provide hidden information about ordering on the same CPU so it's kind of a timestamp in some sense.
@sinkap
What I am saying is that if the timing ends up colliding and we get a
sched_switch
in a line before thesched_wakeup
we could get semantically incorrect data
Can this actually happen?
We should consider that sched_switch
and sched_wakeup
events for a give task cab be generated on different CPUs, isn't it?
In that case we could expect a time delta between the clock of the two CPUs but I would say that ftrace take care about timing differences... what instead I'm not sure is that ftrace can somehow force an ordering when it merge events form different CPU's buffers which appears to have the same timestamp... in that case probably it's just appending events in per-cpu ID order... :-/
About Patrick's point, I don't think ftrace does any intelligent merging in the unlikely event of timestamps on different CPUs colliding. In that case any post-processing index resolution method cannot help correct the ordering because the ordering information is lost so its beyond the scope of this discussion I feel (I feel its more of an ftrace issue than a trappy one). Where line numbers will help is timestamps on same cpu colliding which we should address first.
So just summarizing the email discussion to enlist the cons of apply with DF vs event callbacks (this PR):
The following are the disadvantages of doing an analysis with DF join+apply without using an event callback approach as I initially posted in this PR:
It would be nice if we can all agree on approach by next week since it has been a while since this is posted (I would even say we get on a call on next week with @sinkap and @JaviMerino as well and finalize which direction to go).
Also one more comment, since #243 also is going in the direction of adding callbacks to events for parsing, it could help that PR.
So I tried something like the following on a 100MB systrace, but the trace time went up from 33s to 1 minute. This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles. Any ideas on how to speed it up? This is for measuring wakeup latencies:
sit = switch_in.iterrows()
wit = wake.iterrows()
s = sit.next()
w = wit.next()
while s and w:
if s[1]['__line'] < w[1]['__line']:
switch_cb(s)
s = next(sit, None)
continue
else:
wake_cb(w)
w = next(wit, None)
continue
while s:
switch_cb(s)
s = next(sit, None)
continue
while w:
wake_cb(w)
w = next(wit, None)
continue
This does solve the problems 1, 2, 3 in the last comment, however the apply time doubles.
Do you mean you tried it with a join
and apply
like in my example, and it took twice as long?
No with join and apply it was much faster. But I was thinking if in the future if we didn't want to use join and apply for the reasons mentioned above, then how fast would it be. Also I think your example wasn't working for my particular trace as it wasn't handling all corner cases (I'm not sure whether join and apply will work there but I will give it a look soon)
This series adds event callbacks support to trappy so that trappy users can register callbacks for trace events, that can be used to drive a state machine to do some analysis.
Example analysis script using this is here: https://github.com/joelagnel/lisa/blob/40047eab82ed43152ef9d587ba89c5145cdb5d5b/get_wlat_cbs.py (disclaimer, the example script in the above link is very wip and needs to cleaned up and polished properly)