Closed erap129 closed 5 years ago
Hi Elad,
this documentation might help in understanding the crop/supercrop decoding: https://robintibor.github.io/braindecode/notebooks/Cropped_Decoding.html If this does not answer your question the maybe Robin can give you an intuition.
Best, Lukas
Hi Elad,
so basically you have to distinguish between what happens conceptually and how it is implemented.
Conceptually: Crops (let's say of size 522) are taken out of the trials, leading to 604 training samples per trial as you stated, and these are used to train the network.
Concrete Implementation: "Super crops" are fed to the network (of size 1000 in that case), and dilated convolutions are used to get dense outputs, so to get the 1000-522+1= 479 predictions for each of the contained 479 crops. The crucial thing to realize is, that for our ConvNets without padding, these predictions will be identical to the ones you would get if you passed the 479 crops individually through the ConvNet (using strides, not dilated convolutions). Now, as you correctly state we use 2 super crops from 0-1000 and from 125-1125. So you will get some predictions twice, and these can be removed to compute the mean prediction for a trial for accuracy evaluations and this is what our CroppedTrialMisclassMonitor does https://robintibor.github.io/braindecode/source/braindecode.experiments.html#braindecode.experiments.monitors.CroppedTrialMisclassMonitor . So, differences to the conceptual version are:
Some of this is explained in methods->cropped training in the paper, and also in the link mentioned above in comment https://github.com/robintibor/braindecode/issues/33#issuecomment-428190779
Do you have any further questions, is it more clear now?
Ah, I gracefully avoided the topic of dilated convolutions and now it's coming back at me. So you're saying that using this method we get 479 predictions from one forward pass? Or do we get the averaged result as output? Anyway, I should probably get my head around dilated convolutions and study the to_dense_prediction_model function before I ask any more. Thanks for the detailed response. Elad
We get 479 predictions. During training, we average these predictions before applying the loss function, however, this leads to identical gradients for our choice of nonlinearity (log softmax) and loss function (categorical cross entropy) compared to applying the loss function to each individual prediction and averaging the loss. But the model itself, yes outputs 479 predictions.
OK, will investigate more. Thanks!
If you want to understand dilated convolutions for this context, one idea could be to take a look at figure 4 in our paper: https://onlinelibrary.wiley.com/doi/full/10.1002/hbm.23730 and to draw the lower part of the figure with dilated convolutions, like in the "filters with holes" conceptualization. In the lower part of the figure, the "Split Stride Offsets" and "Interleave" is actually a way to implement dilated convolutions, the way we used in our previous Lasagne/Theano code. It is also explained in the documentation for tensorflow: https://www.tensorflow.org/api_docs/python/tf/nn/atrous_conv2d
Awesome! I'll look into dilated convolutions in general first and then try what you suggested. Thank you.
Hi, from figure 4 in the paper I understand that the dilation is actually performing steps: (i) split stride offsets (ii) convolution (iii) interleave But I'm still trying to grasp what is going on in the bigger network. I see from the code that all maxpool layers are given a stride of 1, and that the dilation increases in each compatible (maxpool, convolution) layer by a factor of 3 for every maxpool layer along the way. Is there any further intuition on why this is equivalent to cropping? many toy examples on paper have let me down... Thanks in advance
Hmm to bad toy examples have let you down. I mean I tried to show a minimal example in the paper https://onlinelibrary.wiley.com/doi/full/10.1002/hbm.23730 in Figure 3 Does https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf help you? Note, "filter rarefication" -> dilated convolutions. Other works, the first paper I know that used this kind of trick, but only for testing is http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.468.6169&rep=rep1&type=pdf later https://arxiv.org/pdf/1312.6229.pdf (see 3.3-3.5).
But toy examples should definitely work! Keep in mind to never use padding. You can also program some toy inputs and make a simple network like conv pool conv pool conv pool, pool always with some stride, apply it to different windows within some larger input. Then change to remove stride and add dilations as I do, so dilation is product of strides of previous layers, and verify output is the same.
Can you show me some toy example on paper that does not work? Maybe I can help you better then.
Hi, sorry for the 2-month delay but I got back to this now, I think I understood the process via toy examples and some toy code. My problem is though, that I am trying to generalize this process to work on any given CNN. I noticed for instance that the final convolution size is hardcoded to 2 for the deep model and 30 for the shallow model and I'm trying to understand why this couldn't be inferred automatically. My goal is to find a way to infer the final convolution kernel size for any given CNN (from that I will get the required crop size as well by running a dummy data sample), do you have an idea how I should do this? Thanks!
well you can use the code I use for trial-based/non-cropped decoding to automatically determine the wanted final conv size: https://github.com/robintibor/braindecode/blob/master/braindecode/models/deep4.py#L28 (keep in mind here model does not have final conv layer at that point. but you could also create full model, remove and readd that layer.)
Hello, Regarding your paper "Deep Learning With Convolutional Neural Networks for EEG Decoding and Visualization": I am currently making a re-implementation of your experiments with the BCI-IV-2a dataset with keras, and I stumbled upon something that I didn't understand. In the paper you state that the cropped version of classification will take crops of size 522 samples, so that means (1125-522+1 = 604) training examples per trial. What I see in your implementation (specifically - CropsFromTrialsIterator) is that it creates the supercrops, and uses them directly as training data! So for example from what I saw the CNN is fed with 2 crops per trial. Specifically for dataset IV-2a, the trial lengths are 1125 and the supercrop size is 1000, so the CNN received the crops [0, 1000] and [125, 1125] for each trial. Is this a mistake on my part? Or did I not understand the process described in the paper?
Sorry if this is irrelevant as a github issue, but I didn't find a more appropriate place to ask. Thanks in advance, Elad