emanuel-metzenthin / Lime-For-Time

Application of the LIME algorithm by Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin to the domain of time series classification
95 stars 21 forks source link

Discussion - does the time dimension really matter in these datasets? #3

Closed sometimescasey closed 3 years ago

sometimescasey commented 5 years ago

Hi guys! Thanks so much for putting this work up. Codebase runs great - all the notebooks run fine for me and the outputs look good. I have a more conceptual question about how LIME is applied to your two chosen datasets - hope this is the right place to ask :)

Do you think it's reasonable to say that the "time" dimension of the two datasets you have chosen is not actually important to the classification problem? As in, I could randomly shuffle the order of the "slices" to generate a new training set, and the model behaviour would not change? And your LIME explainer would still highlight the same "slices" as the most important ones, albeit in the newly shuffled order?

I looked through your .pdf presentation and I still have some questions:

wrt the coffee dataset: do you have any more context on how this dataset was generated / preprocessed? Based on your cited source (https://www.researchgate.net/publication/229137574_Discrimination_between_Arabica_and_Robusta_green_coffee_using_visible_micro_Raman_spectroscopy_and_chemometric_analysis) I am guessing that the x-axis of the data is the averaged wavenumber from the Raman spectra of the coffee bean, not a time series...is that correct?

I don't have a great grasp on the details but my grossly oversimplified understanding is that different wavenumbers correspond to different chemical composition of components (lipids, acids, caffeine etc) within the bean. In this sense the different "slices" correspond to different "chemical" dimensions, not time buckets...is that correct? Please let me know if I'm missing something!

wrt to the ECG dataset: I assume these are measures of heart activity over time. (Which specific dataset is this?) However since the class target for each row remains constant throughout the entire data series, I believe we should be able to shuffle the "slice order" and still obtain the exact same classification outputs. Is that correct?

Please let me know if I'm misunderstanding something! Thank you :)