Closed joelagnel closed 7 years ago
CC @bjackman
All great suggestions. Will fix up.
About pickle, I still prefer csv because it's easier to investigate the cache in clear text and also is more immune to the internal format of pandas objects than binary object representation I think. Also I am also not sure how much faster we can get because we are already so fast with csv.
Also from this link it looks like csv is faster than pickle: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
So this is the new version with all review comments addressed.
Hopefully this one is merge-ready
I pushed an extra test, along with some microscopic fixes for stuff I noticed while writing it, to the branch on my repo here: https://github.com/bjackman/trappy/tree/for-trappy-cache
So I guess feel free to cherry-pick/squash that into your branch? Or you can leave it and I'll submit it separately once this is merged (I'm ill this week and cannot think clearly at all so the code may be nonsense).
Squashed @bjackman's fixups except his new test case (which IMO can be added after this set is merged, had to drop it because the test case itself had a bug). Pushed a new version. @JaviMerino @bjackman does it look good to you?
This is good from my point of view, let's see what others think.
I updated my branch (https://github.com/bjackman/trappy/tree/for-trappy-cache) with the missing file for the test I added. I also added another test for something I was confused about, and an extra check in _check_trace_cache, to avoid a warning if the set of parsed events has changed since the cache was written.
Also, AFAICT this currently breaks the normalize_time
and window
/abs_window
params. Probably the simplest solution is just to add them to the md5sum or something like that?
edit: Does it also break the events
parameter? I think this case is actually broken (as @credp suggested):
events=['foo']
, cache is created but all .csv files except the one for "foo" are empty)events=['bar']
, the DataFrame for "bar" is empty. Maybe it's safest to just store & check the whole parameter list for __init__
CC @sinkap
Updated with @bjackman 's latest commits. Thanks! Will work on the other corner cases now (new event passed after cache creation). @bjackman while I do that could fix the normalize time issue you mentioned if you had time? I am new to that code so I could use some help there. Thanks!
@bjackman so I dropped "ftrace: Add explicit check for presence of each .csv file" and added a new commit which reads the cache partially. Your new test case for dynamic events added after caching passes with that.
I updated my branch (https://github.com/bjackman/trappy/tree/for-trappy-cache) with
finalize_objects
)window
parameter.@bjackman I will review your branch today. Since it seems to have some more fixes. @joelagnel
@bjackman I think you didn't update your branch (last update was June 8 to https://github.com/bjackman/trappy/tree/for-trappy-cache and I already picked up those fixes). Also could you rebase on top of master when you push again? Thanks!
Eek.. you're right I didn't push, and I can't find the commits on my desktop... I will have to check when I get to my laptop :fearful:
@bjackman feel free to ping when you have the updated commits.
@bjackman not having caching is killing me, can't wait to have this goodness in with your fixes ;D
I think @joelagnel and @bjackman addressed Javi's change request, so I am going to dismiss that review to remove the big red cross :+1:
thanks @credp
So where are we with this PR? Looks just about ready pending a push from @bjackman? (cc @joelagnel )
Yep. we have a sync meeting today. Let's finalize it there.
On Jun 26, 2017 10:27 AM, "Chris Redpath" notifications@github.com wrote:
So where are we with this PR? Looks just about ready pending a push from @bjackman https://github.com/bjackman? (cc @joelagnel https://github.com/joelagnel )
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ARM-software/trappy/pull/247#issuecomment-310995499, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDZbOCEx0OqPp5gwPr5WoH2es9551sQks5sH2t-gaJpZM4Na7y5 .
Hey guys, really sorry for causing delays here, I forgot about it then went on holiday. Will update tonight when I get home (where the lost commits are on my computer).
@bjackman Enjoy your holiday! :smile:
I am merging #253. It would be best to rebase this :)
OK, I've pushed the commits now: https://github.com/bjackman/trappy/tree/for-trappy-cache
Forgotten the details so I'll do a self-review tomorrow :)
@bjackman, also could you rebase on top of @credp 's changes which got merged? :) thanks man.
On Tue, Jun 27, 2017 at 11:21 AM, Brendan Jackman notifications@github.com wrote:
OK, I've pushed the commits now: https://github.com/bjackman/ trappy/tree/for-trappy-cache
Forgotten the details so I'll do a self-review tomorrow :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ARM-software/trappy/pull/247#issuecomment-311442899, or mute the thread https://github.com/notifications/unsubscribe-auth/AACSVD3ZYGhZgSDOTZTSPAjoBmTPZgKbks5sIUgxgaJpZM4Na7y5 .
OK I updated the branch. The rebase was pretty awkward so I think the patch stack is messed up, but TBH I think we should squash them all anyway instead of having "add X", "fix X" in a single PR.
Thanks @bjackman , squashed as many as that made sense to squash. HAPpy SNAPpy TRAPpy!
Merged!
Pandas is extremely fast at parsing csv to data frames. Astonishingly it takes < 1s to serialize/deserialize a 100MB work of traces with 430000 events to/from csv. We leverage this and write out a data frames into a csv file when they are created for the first time. Next time we read it out if it exists. To make sure, the cache isn't stale, we take the md5sum of the trace file and also ensure all CSVs exist before reading from the cache. I get a speed up of 16s to 1s when parsing a 100MB trace.
Signed-off-by: Joel Fernandes joelaf@google.com