Implementation misunderstanding - cropped iterator

erap129 commented 6 years ago

Hello, Regarding your paper "Deep Learning With Convolutional Neural Networks for EEG Decoding and Visualization": I am currently making a re-implementation of your experiments with the BCI-IV-2a dataset with keras, and I stumbled upon something that I didn't understand. In the paper you state that the cropped version of classification will take crops of size 522 samples, so that means (1125-522+1 = 604) training examples per trial. What I see in your implementation (specifically - CropsFromTrialsIterator) is that it creates the supercrops, and uses them directly as training data! So for example from what I saw the CNN is fed with 2 crops per trial. Specifically for dataset IV-2a, the trial lengths are 1125 and the supercrop size is 1000, so the CNN received the crops [0, 1000] and [125, 1125] for each trial. Is this a mistake on my part? Or did I not understand the process described in the paper?

Sorry if this is irrelevant as a github issue, but I didn't find a more appropriate place to ask. Thanks in advance, Elad

Mahelita commented 6 years ago

Hi Elad,

this documentation might help in understanding the crop/supercrop decoding: https://robintibor.github.io/braindecode/notebooks/Cropped_Decoding.html If this does not answer your question the maybe Robin can give you an intuition.

Best, Lukas

robintibor commented 6 years ago

Hi Elad,

so basically you have to distinguish between what happens conceptually and how it is implemented.

Conceptually: Crops (let's say of size 522) are taken out of the trials, leading to 604 training samples per trial as you stated, and these are used to train the network.

Concrete Implementation: "Super crops" are fed to the network (of size 1000 in that case), and dilated convolutions are used to get dense outputs, so to get the 1000-522+1= 479 predictions for each of the contained 479 crops. The crucial thing to realize is, that for our ConvNets without padding, these predictions will be identical to the ones you would get if you passed the 479 crops individually through the ConvNet (using strides, not dilated convolutions). Now, as you correctly state we use 2 super crops from 0-1000 and from 125-1125. So you will get some predictions twice, and these can be removed to compute the mean prediction for a trial for accuracy evaluations and this is what our CroppedTrialMisclassMonitor does https://robintibor.github.io/braindecode/source/braindecode.experiments.html#braindecode.experiments.monitors.CroppedTrialMisclassMonitor . So, differences to the conceptual version are:

you always train on neighbouring crops in each batch (although from multiple trials)
Some crops appear more often in the training (the overlaps that are removed for computing prediction accuracy). This could be removed by putting some mask to not train on some predictions, but since we were satisified with overall results, we never investigated this further

Some of this is explained in methods->cropped training in the paper, and also in the link mentioned above in comment https://github.com/robintibor/braindecode/issues/33#issuecomment-428190779

Do you have any further questions, is it more clear now?

erap129 commented 6 years ago

Ah, I gracefully avoided the topic of dilated convolutions and now it's coming back at me. So you're saying that using this method we get 479 predictions from one forward pass? Or do we get the averaged result as output? Anyway, I should probably get my head around dilated convolutions and study the to_dense_prediction_model function before I ask any more. Thanks for the detailed response. Elad

robintibor commented 6 years ago

We get 479 predictions. During training, we average these predictions before applying the loss function, however, this leads to identical gradients for our choice of nonlinearity (log softmax) and loss function (categorical cross entropy) compared to applying the loss function to each individual prediction and averaging the loss. But the model itself, yes outputs 479 predictions.

erap129 commented 6 years ago

OK, will investigate more. Thanks!

robintibor commented 6 years ago

If you want to understand dilated convolutions for this context, one idea could be to take a look at figure 4 in our paper: https://onlinelibrary.wiley.com/doi/full/10.1002/hbm.23730 and to draw the lower part of the figure with dilated convolutions, like in the "filters with holes" conceptualization. In the lower part of the figure, the "Split Stride Offsets" and "Interleave" is actually a way to implement dilated convolutions, the way we used in our previous Lasagne/Theano code. It is also explained in the documentation for tensorflow: https://www.tensorflow.org/api_docs/python/tf/nn/atrous_conv2d

erap129 commented 6 years ago

Awesome! I'll look into dilated convolutions in general first and then try what you suggested. Thank you.

erap129 commented 6 years ago

Hi, from figure 4 in the paper I understand that the dilation is actually performing steps: (i) split stride offsets (ii) convolution (iii) interleave But I'm still trying to grasp what is going on in the bigger network. I see from the code that all maxpool layers are given a stride of 1, and that the dilation increases in each compatible (maxpool, convolution) layer by a factor of 3 for every maxpool layer along the way. Is there any further intuition on why this is equivalent to cropping? many toy examples on paper have let me down... Thanks in advance

robintibor commented 6 years ago

Hmm to bad toy examples have let you down. I mean I tried to show a minimal example in the paper https://onlinelibrary.wiley.com/doi/full/10.1002/hbm.23730 in Figure 3 Does https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf help you? Note, "filter rarefication" -> dilated convolutions. Other works, the first paper I know that used this kind of trick, but only for testing is http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.468.6169&rep=rep1&type=pdf later https://arxiv.org/pdf/1312.6229.pdf (see 3.3-3.5).

But toy examples should definitely work! Keep in mind to never use padding. You can also program some toy inputs and make a simple network like conv pool conv pool conv pool, pool always with some stride, apply it to different windows within some larger input. Then change to remove stride and add dilations as I do, so dilation is product of strides of previous layers, and verify output is the same.

Can you show me some toy example on paper that does not work? Maybe I can help you better then.

erap129 commented 5 years ago

Hi, sorry for the 2-month delay but I got back to this now, I think I understood the process via toy examples and some toy code. My problem is though, that I am trying to generalize this process to work on any given CNN. I noticed for instance that the final convolution size is hardcoded to 2 for the deep model and 30 for the shallow model and I'm trying to understand why this couldn't be inferred automatically. My goal is to find a way to infer the final convolution kernel size for any given CNN (from that I will get the required crop size as well by running a dummy data sample), do you have an idea how I should do this? Thanks!

robintibor commented 5 years ago

well you can use the code I use for trial-based/non-cropped decoding to automatically determine the wanted final conv size: https://github.com/robintibor/braindecode/blob/master/braindecode/models/deep4.py#L28 (keep in mind here model does not have final conv layer at that point. but you could also create full model, remove and readd that layer.)

TNTLFreiburg / braindecode

Implementation misunderstanding - cropped iterator #33