garethjns / Kaggle-EEG

Seizure prediction from EEG data using machine learning. 3rd place solution for Kaggle/Uni Melbourne seizure prediction competition.
101 stars 29 forks source link

Run Time #15

Closed YasminAMassoud closed 3 years ago

YasminAMassoud commented 4 years ago

Hello

The train files has been running more than 16 hours and feature extractions is still not finished, i want to run using parallel processing toolbox. I have went though matlab videos and saw presentation for 3rd algorithm on threads division but not sure how to implement this. Will you please guide me on this issue ? @garethjns

Thanks Alot

garethjns commented 4 years ago

Hi Yasmin,

What hardware are you running on? I can't remember exactly how long it takes to run, but I think it was at least 12 hours on a 4 core i7700k loading from an SSD.

It might be possible to speed up a bit using the parallel processing toolbox, but a lot of the time spent in the feature extraction is doing FFTs which are parallel anyway. Looking at the code, these are in loops across channels (here for example). The are possible candidates for parallelizing using par for loops, however, if I remember correctly, the FFT function will still use all threads even for a single channel. This would mean there's little advantage to using the par for loop to call it.

One other option might be to cache the output of the FFTs in an earlier step, then reload them as needed during the feature extraction steps. This would remove a lot of time spent in redundant calculations.

YasminAMassoud commented 4 years ago

can you guide me more where to read about this step " One other option might be to cache the output of the FFTs in an earlier step, then reload them as needed during the feature extraction steps. This would remove a lot of time spent in redundant calculations"

Thanks Dr.Gareth @garethjns

garethjns commented 4 years ago

The featuresObject has the methods for calculating each of the feature groups. The features that are generated in the frequency domain all call fft, for example (there may also be more):

This is poor design; it would be better to have a step that calculates and saves the fft data for each channel, before the feature generation, then simply reload it where needed. Here it affects performance a lot because the features are calculated 3 times for different window sizes (ie. 3 x 3 fft calculations). Ideally it should only calculate the fft once for each window size (3 x 1 ffs calculations). I expect this would improve performance a lot.