ARM-software / trappy

This repository has moved to https://gitlab.arm.com/tooling/trappy
Apache License 2.0
60 stars 39 forks source link

ftrace: fix windowing KeyErrors #277

Closed derkling closed 5 years ago

derkling commented 6 years ago

For some traces and events (e.g. SchedSwitch) we get a KeyError in case the first event index is bigger then window[0]. For example:

trace_class: <trappy.sched.SchedSwitch object at 0x7f3afe425f10> window: [11793.005501, None] comm cpu line pid next_comm next_pid [...] Time 11793.005507 <...> 0 2 6803 shell svc 6799 6800 [...] 11793.005565 <...> 0 4 6800 adbd 2648 [...] 11793.005701 adbd 0 7 2648 ->transport 2650 [...] 11793.005782 ->transport 0 11 2650 trace-cmd 6803 [...] 11793.005831 <...> 0 17 6803 kworker/u17:0 367 [...]

[...]

KeyError: 11793.005501

Where the min window value [11793.005501] is defined by the first valid event in the trace. Oddly this does not always happens, expecially it seems to not happen with trappy.dynamic events.

However, since in these cases the dataframe should not be sliced at all, let's add some boundary checks to ensure we always provide "valid" slicing index values. This has been verified to fix the above slicing issue in some test traces.

Signed-off-by: Patrick Bellasi patrick.bellasi@arm.com

valschneider commented 6 years ago

I need to take some time to have a better look at this (and fix Travis)- in the meantime, could you add a test (trappy/tests) that highlights this issue ? Thanks.

valschneider commented 6 years ago

I don't have anything against the patch, the thing is Pandas should be able to handle this:

data = [
    42,
    43,
    55
]

index = [
    10.2,
    10.5,
    10.8
]

df = pd.DataFrame(data, index=index)

print df[10.1:]
print df[10.2:]
print df[10.3:]

Outputs:

       0
10.2  42
10.5  43
10.8  55

       0
10.2  42
10.5  43
10.8  55

       0
10.5  43
10.8  55

@derkling had encountered this problem a few months back, and AFAIR it was related to this: https://github.com/pandas-dev/pandas/issues/6066 IOW, his dataframe wasn't sorted which would cause errors during the slice. I can't find it now but I remember reading that the handling of FloatIndex slicing was changed in a more recent pandas version, and after he updated his Pandas it seemingly fixed it.

I think this is a nastier issue than just slicing outside of a dataframe boundary, which is why I'd like to a) get to the bottom of the bug. b) have a test that verifies it doesn't happen again if the fix is non-obvious/hacky.

derkling commented 6 years ago

Here is the trace which fails for us:

https://www.dropbox.com/s/l0f15i3h8j6wr00/trace.dat?dl=0

Sadly, I can get it parsed correctly or failing depending on whether I'm within a venv or not... thus I suspect the issue is possibly still related to a specific version of some library.

Library         System      VEnv
===================================

More likely:
matplotlib      2.1.0       2.2.2
scipy           0.17.0      1.0.1

Less likely:
numpy           1.14.0      1.14.2
bart-py         1.7.0       2.0.0

Not likely:
devlib          0.0.3       0.0.4
pandas          0.22.0      0.22.0
TRAPpy          5.9.0       5.9.0

However, with the fix proposed by the patch I'm able to get the trace always parsed... thus I would say it makes sense to have in general to be a bit more robust wrt library versions.

Unfortunately, if I add the topmost events of the trace to new test in tests/test_ftrace.py... something like:

+    def test_ftrace_window_bigger(self):
+        """Test slicing with window bigger then the actual trace"""
+        in_data = """
+           <...>-6803  [000] 11793.005494: bprint:                ip=0xffffff863a705a24 fmt=0xffffff863bebdbd0 buf=0
+           <...>-6803  [000] 11793.005501: sched_wakeup:          comm=shell svc 6799 pid=6800 prio=120 success=1 target_cpu=0
+           <...>-6803  [000] 11793.005507: sched_switch:          prev_comm=trace-cmd prev_pid=6803 prev_prio=120 prev_state=2048 next_comm=shell svc 6799 next_pid=6800 next_prio=120"""
+
+        with open("trace.txt", "w") as fout:
+            fout.write(in_data)
+
+        try:
+            trace = trappy.FTrace(events=["sched_wakeup", "sched_switch"])
+        except TypeError as e:
+            self.fail("trace slicing failed with {} exception".format(e.message))
+               # The sched_switch even is the second valid in the trace, thus it's index must be normalized
+        self.assertEquals(float("{0:.6f}".format(trace.sched_switch.data_frame.index[0])), 0.000006)
+        self.assertEquals(len(trace.sched_switch.data_frame), 1)
+

This test always passes... independently from being executed from within or outside the venv... and independently from having or not the patch :/

... I'm also wondering if it really makes any sense to have such a test... since the issue seems more related to a library version, while the proposed patch is just to make the slicing a bit more robust?

valschneider commented 6 years ago

That's a big difference in SciPy versions. Do you get the exception with or without the venv ?

As for making slicing more robust, I don't entirely disagree. What I'm not too keen on is adding code to support library versions that are not used within the venv. IMHO it kind of defeats the purpose of the venv.

Otherwise, if the error shows up in the venv then we should fix the libraries used, and if the error persists (e.g. the expected behaviour was only there in older versions), then it would make sense to change the code.

Admittedly this is me being stubborn and saying:

why bother adding code to support old libraries versions that do not exhibit the expected behaviour when we can just tell the user to use the venv / update the required libraries

while you're saying

why bother with the venv when you can fix the problem in a few lines

valschneider commented 6 years ago

@derkling I forgot about this PR. I think https://github.com/ARM-software/trappy/pull/279 fixes the error you were seeing.