base.py: handle any duplicated indices when creating dataframe

qais-yousef commented 5 years ago

When running Lisa tests on fastmodels duplicated timestamps happen frequently and cause some failures down the line.

The duplicated timestamps are not bad themselves and a side effect of simulated time not having good enough resolution.

Handling this duplication at the dataframe creation seems to be the most generic and future proof solution.

Signed-off-by: Qais Yousef qais.yousef@arm.com

qais-yousef commented 5 years ago

Hmm tests/test_idle.py fails now because the test has duplicated entries in the series being tested.

derkling commented 5 years ago

@valschneider I think to remember that at a certain point we considered making use of the line number in conjunction with the timestamp to ensure a unique index. Still wondering it that should not be the way instead of playing with fixing up timestamps handle_duplicate_index

qais-yousef commented 5 years ago

I think to remember that at a certain point we considered making use of the line number in conjunction with the timestamp to ensure a unique index. Still wondering it that should not be the way instead of playing with fixing up timestamps handle_duplicate_index

A quick search on pandas returns multi-index support but still the impact on performance and how much of the code will need to change to deal with that warrants a proper investigation..

Removing the duplicates seems an easier way forward? Can you think of disadvantages or things that could break?

valschneider commented 5 years ago

@valschneider I think to remember that at a certain point we considered making use of the line number in conjunction with the timestamp to ensure a unique index. Still wondering it that should not be the way instead of playing with fixing up timestamps handle_duplicate_index

It would ensure uniqueness, yes, but using {line number, timestamp} as a dual index is not an option for us - that'd break the API (and sorting/slicing performance would take a significant hit). Or did you mean something else?

We could use the line number to ensure correct ordering of events with duplicate timestamps, at least in regards to the trace, but I think that's already the case by construction.

derkling commented 5 years ago

Yes, I guess using the line number at least to ensure correct ordering of fixed timestamp is the bare minimum we can do without breaking APIs... that would require to have a look at how handle_duplicate_index handle duplications and add in support for the line number as an additional check. BTW: how kernelshark handle and visualize those duplicated events?

qais-yousef commented 5 years ago

BTW: how kernelshark handle and visualize those duplicated events?

Nothing special about them. You just see them all appearing at the same point of time.

qais-yousef commented 5 years ago

I added a new comment to handle __line column if it exists. Running some tests against it now to make sure that fastmodel doesn't break in new ways now :)

JaviMerino commented 5 years ago

duplicated timestamps happen frequently and cause some failures down the line

Can you elaborate on those "failures down the line". We never used it in dataframe generation because we didn't see the need for it.

qais-yousef commented 5 years ago

Can you elaborate on those "failures down the line". We never used it in dataframe generation because we didn't see the need for it.

My original attempt to fix the problem with a verbose commit messages with the exact points of failures is here: https://github.com/ARM-software/lisa/pull/864

I could expand the commit message here too I guess but felt it'd be rather noisy. Would be happy to include the backtrace from the 2 commit messages in the lisa PR referenced above if you want.

JaviMerino commented 5 years ago

Would be happy to include the backtrace from the 2 commit messages in the lisa PR referenced above if you want.

That or just a reference to Arm-software/lisa#864 . Either is fine.

While you are at it, can you remove the call to handle_duplicate_index in stats/grammar.py? It shouldn't be needed now.

qais-yousef commented 5 years ago

While you are at it, can you remove the call to handle_duplicate_index in stats/grammar.py? It shouldn't be needed now.

Sure done, I updated the PR to include that and I reference https://github.com/ARM-software/lisa/pull/864 in the commit message that introduces handle_duplicate_index() in base.py

My run with the fastmodel seem to be happy with this. I'll have another run to verify the new change to grammer.py isn't causing any unexpected behavior (although I'm not sure whether the code is being exercised or not).

JaviMerino commented 5 years ago

I'll have another run to verify the new change to grammer.py isn't causing any unexpected behavior (although I'm not sure whether the code is being exercised or not).

Ok, let me know when you are happy and I'll merge it.

qais-yousef commented 5 years ago

Thanks! Lisa ran without a problem. So hopefully this means it doesn't break anything obviously at least and good to go :)

ARM-software / trappy

base.py: handle any duplicated indices when creating dataframe #292