Pump Chunk Marking - Githubissues

fleerdayo commented 3 years ago

For each resampled (chunked) pump csv file, did you only mark 1 chunk as True?

E.g. if there is a pump at 2019-03-1 17.00 and I chunked my csv data into 5 second chunks (and only taking into consideration the pump day and 1 day before and after), I only marked the chunk from 17.00.00 to 17.00.05 as True. This leaves me with an extremely imbalanced dataset so that a RandomForrestClassifier ends up predicting every chunk as False.

What am I missing here?

------- Offtopic ----------- Also thank you guys for your effort to collect all the data. I enjoyed reading your paper too and got lots of useful information out of it. It's a welcome distraction to fiddle around with your data during all the restrictions :)

RaibekTussupbekov commented 3 years ago

Hello, @fleerdayo :)

I've been trying to reverse engineer the paper model for the last two weeks:)

I've been able to achieve 77.907 % recall but ridiculously low 0.185 % precision:(

I use imblearn.ensemble.BalancedRandomForestClassifier to undersample the data.

I tried to cut off 30 minutes after each pump chunk because the paper says that "...Once a pump is detected we pause our classifier for 30 minutes to avoid multiple alerts for the same event..."

However it does not help.

So I believe that the data should be filtered before training. The paper says that the authors picked only 104 samples out of 175.

Maybe this is the main reason of so many false positives?

Let me know if you're still interested. We could collaborate:) I see that the authors are not responding here:) Maybe they are too busy...or too rich:) Just kidding:)

Btw I'm ready to share my code and collaborate with whoever is interested including the authors:)

RazcoDev commented 3 years ago

Hey @RaibekTussupbekov , did you mange to make this work ? I also encounter many issues with the dataset. Thanks !

RaibekTussupbekov commented 3 years ago

@RazcoDev Not yet

SystemsLab-Sapienza / pump-and-dump-dataset

Pump Chunk Marking #2

------- Offtopic ----------- Also thank you guys for your effort to collect all the data. I enjoyed reading your paper too and got lots of useful information out of it. It's a welcome distraction to fiddle around with your data during all the restrictions :)