Improvements for baseline-cnn

Hey guys, I want to use this issue to give you a short overview of the work I did so far on the baseline-cnn. After I found the tutorial + code (see the link in the README.md), my main goal was to get it running on our data set in order to get a first idea of how good it performs.

Since the model performed really good out of the box, we could use it as a foundation and try to improve it. While working on the code, I found several ways to improve the model. In this issue I want to document them so that we can work on them later. Feel free to open new and separate issues for some of the items, I'll just give an overview here:

[ ] The encoder requires input images of size 256x256. The training images are 400x400 and the test images are 608x608 pixels. Currently, the input images are simply down-scaled to 256x256 pixels. We're obviously loosing some information here, so maybe there's a better option than down-scaling. My idea is to take (possibly overlapping) patches of 256x256 pixels from each input image and let the model predict the correct segmentation for them. In a post-processing step, we could assemble all the patches taken from an image and stitch them together in order to get the segmentation for the original, full-resolution image.

Juan: the patch idea is good I believe, and we could resolve by averaging or an average weighted with the distance to the border of each path.

[ ] Similarly to the point above, the decoder outputs predictions of size 256x256, which is again not the correct size of the predictions. Currently, I just up-sample them to the required size (400x400 for training images, 608x608 for test images). I didn't even check which interpolation algorithm is used, but we're likely loosing accuracy there, too. The idea I outlined above would also solve this problem.

Juan: probably irrelevant, but we could also try cv2.resize function instead of tensorflow's, you never know.

[ ] For the two points above I have to mention that the submission does not require pixel-accuracy. The provided code rather averages the predictions for 16x16 patches and applies a threshold to decide whether this patch is labelled as foreground or background. We could use this fact to predict smaller than necessary segmented images and adjust the patch size accordingly, e.g. we could predict images with only half the size and then average over patches of 8x8 pixels. To put this to an extreme, we could reduce the size of our predictions to only 38x38 pixels. Maybe we can use that to our advantage, i.e. by having less trainable parameters in our model. Or we could use input images as small as 16x16 pixels and simply classify these (i.e. a 1x1 output)? I am, however, not sure if these ideas violate any rules..?

Juan: I think this way we would be missing a lot of information. My intuition is that the bigger the patches the better, cause there is more context for the network to extract patterns such as lines or circles.

[ ] At the moment, the mentioned threshold for the patch label is set to 0.25 . There's probably room for improvement in this parameter.

Juan: I am not sure this would be legal. And also we can't train this parameter. We might try a couple of values for the final submission anyway.

[ ] The model I used for the first submission to kaggle was trained using a combination of the dice loss and binary cross-entropy. Again, this was the loss that was already implemented in the tutorial code. I added the root mean squared error as a metric and used the model checkpoint with the lowest RMSE to calculate the predictions. Using two different criteria/losses for training and for selecting the optimal model may not be the best idea, idk, but I didn't pay too much attention to it when I trained the first models. In general, we can try other loss functions (or combinations) for training.

Juan: I have tried training with RMSE as loss for curiosity and it doesn't work. All the experiments I've tried so far also used dice loss or a combination with cross entropy, but I couldn't notice any difference between them.

[ ] Cross validation! I already implemented CV for another model that I didn't add to the repo yet, but I haven't added CV to the baseline-cnn. So I case anyone starts working on any of this, I can send you the code to save you some time.
[ ] The model is a encoder-decoder network with blocks of two convolutional layers + batch normalization + relu activation layer, followed by a max-pooling layer. Six of these blocks are concatenated and reduce the image size from 256x256 to 8x8. The number of filters increases from 32 in the first block to 1024 in the last block. The decoder has a similar structure, just reversed. So far, I just tried to train a model with residual connections (see the resnet-50 paper); the performance was similar to the model without residual connections. I haven't tried to use more layers, more filters or a significantly different architecture, yet. (also because I didn't train the models on the cluster but on my laptop, so resources were limited).
[ ] Maybe we can improve the data augmentation in the input pipeline, e.g. by adding fully random rotations or by adding a bit of noise to the training images..? Idk, be creative :smile:

Juan: Yes you are right, but perhaps it is not so useful if we can use Google's API data.

Feel free to comment on the points or add more points. Oh, and also we should have a look in the literature.

Juan: https://arxiv.org/pdf/1711.10684.pdf I found this paper (probably also you) and tried their model on our data. In particular I copied the model they had in a repo (which is different from the one in the paper, also tried this one) played with kernel sizes and depth. The performance wasn't better than our baseline, and in fact the model is very similar to what Jonas already implemented. I believe we are in the right path taking U-NET as base model, but in case we get stucked we can use some of this other state of the art methods as inspiration https://medium.com/@arthur_ouaknine/review-of-deep-learning-algorithms-for-image-semantic-segmentation-509a600f7b57

[ ] Since this is a fully convolutional network, we can change the implementation so that it accepts input images of arbitrary size. We only need to make sure that the zoom level is the same in train and test images (i.e. the test images are larger because they show a larger area). I'd say this is the most convenient solution.

Juan: Yes you are totally right. I've already tested this, reducing the depth of U-NET by 1, and the results were very close. We could also try cropping the training images (which would also work as data augmentation) to 384 so this way we would keep having depth 5.

[ ] After seeing the results we are getting, I believe that some postprocessing might help. We can somehow exploit the geometry of roads by thickening/extending lines in the predicted mask, try to remove stains that are in the middle of a big black area, etc. Hopefully you have better ideas. I've also realized that different models fail and detect different parts of the same images, so maybe there is a smart way of combining the outputs. In the test set, do you think we are having more false positives or false negatives?

I believe that some postprocessing might help. We can somehow exploit the geometry of roads by thickening/extending lines in the predicted mask

tl;dr: Looking at the predictions of the baseline-cnn, I also think that this is good idea. The width of the predicted roads varies a lot and often the roads are interupted. There are many parallel streets and 90° intersections in the images which we could try to exploit.

I'll attach one example below with the original image, the prediction and a thresholded prediction (which I just created manually to check which roads are detected also with a very low probability). In this test image, the roads on the parking space are all detected, but the model assigned a very low probabilty to most of the parallel roads, which is why they are barely visible in the prediction. Only in the thresholded prediction (threshold=4, i.e. a probability of 4/256 = 1.5%), they are clearly visible (of cause the thresholded image is a bit over the top, it's just for illustration purposes). From this single example and a couple of other test images that I looked at in the last ~30min~ ~hour~ two hours, I'd say that the model is definitely tending towards predicting the background-class, i.e. we probably have more false negatives than false positives. I couldn't find a good example of false positives, but false negatives are in almost every image. However, I didn't work on this in the last weeks and I only had a look at a fraction of the test images (and none of the training/validation images), so take this with a grain of salt.

Of cause we can create an ensemble out of multiple models.

Original Test_25 image: test_25_image Baseline-CNN prediction: test_25_prediction Thresholded prediction: test_25_threshold4

I see, then we could try lowering the value of foreground_threshold for a previous submission to have an idea of the impact of this. Have a look at two sets of predictions, the left one has more background and smoother than the one on the right. I didn't write down to which models these outputs belong, so I can't tell what produced the differences. It could be the depth of the network, the input size (128/256), the loss (dice or combined) or just the number of epochs.

The predictions on the right look like black and white images, as they have only strong gradients and no gray pixels (i.e. ~50% probabilities), In comparison, the model on the left is less sure about what is part of a road and what isn't, but it seems to detect more roads than the one on the right, even if these roads have a relatively low probability. The one on the right looks like overfitting to me, what do you think?

Juan: I am not sure if there is overfitting or not. What I'm really confused about and would explain why these apparently very different predictions get similar scores in Kaggle is that in the end mask_to_submission rounds off predictions. This could also explain why the validation score is much better, since it takes uncertainty into account. I have just tested submitting a prediction with probabilities and it doesn't throw any errors, but perhaps it just rounds off predictions internally.

About the postprocessing: One idea that I have is to exploit the grid-like layout by looking at the major orientation of the gradients in the predicted images as well as the gradients which are orthogonal to this orientation (i.e. via Hough transformation? idk, it's late). Both the major gradient orientation as well as the orthogonal gradient orientation are basically "voting" for the same grid orientation. If we find this dominant grid orientation, we could boost all gradients which are aligned with this grid. Of cause we cannot surpress any gradients which do not align with this grid (since there might be diagonal roads), but this way we could probably improve the detection of (straight) parallel and orthogonal roads. (I hope it's clear what I mean 😄 )

Juan: Never heard of Hough transformation, but I think your idea is good 👍. Another posibility, probably infeasible, is to use another ML method to transform predictions, like for example style transfer. I would bet we would loose a lot of accuracy there but would be super cool if it works. We could also try to introduce a penalty/reward term in the loss, though maybe doing this pixelwise is not posible.

Btw, I forgot to comment on one of your comment, regarding my idea to reduce the output image size:

Juan: I think this way we would be missing a lot of information. My intuition is that the bigger the patches the better, cause there is more context for the network to extract patterns such as lines or circles.

I agree that we shouldn't reduce the input image size (since the context is very important), but we could still try to remove the deconvolutional layers and just output a 38x38 image (i.e. output one pixel for each 16x16 patch, and skip this postprocessing/mask_for_submission algorithm). But as I said, neither am I not sure if that's allowed (since we would not predict full-size images anymore), nor if it improves the performance. Technically, we'd have less trainable parameters and less layers to train.

Juan: I'm not sure we would get a meaningful prediction, but it won't hurt trying. But also assuming it works, we would be loosing accuracy when downscaling the masks for training right?

ardaduz / cil-road-segmentation

Improvements for baseline-cnn #2