ardaduz / cil-road-segmentation

Road Segmentation from Aerial Images - Computational Intelligence Lab, ETHZ, Spring 2019
0 stars 0 forks source link

Improvements for baseline-cnn #2

Open jonashein opened 5 years ago

jonashein commented 5 years ago

Hey guys, I want to use this issue to give you a short overview of the work I did so far on the baseline-cnn. After I found the tutorial + code (see the link in the README.md), my main goal was to get it running on our data set in order to get a first idea of how good it performs.

Since the model performed really good out of the box, we could use it as a foundation and try to improve it. While working on the code, I found several ways to improve the model. In this issue I want to document them so that we can work on them later. Feel free to open new and separate issues for some of the items, I'll just give an overview here:

Juan: the patch idea is good I believe, and we could resolve by averaging or an average weighted with the distance to the border of each path.

Juan: probably irrelevant, but we could also try cv2.resize function instead of tensorflow's, you never know.

Juan: I think this way we would be missing a lot of information. My intuition is that the bigger the patches the better, cause there is more context for the network to extract patterns such as lines or circles.

Juan: I am not sure this would be legal. And also we can't train this parameter. We might try a couple of values for the final submission anyway.

Juan: I have tried training with RMSE as loss for curiosity and it doesn't work. All the experiments I've tried so far also used dice loss or a combination with cross entropy, but I couldn't notice any difference between them.

Juan: Yes you are right, but perhaps it is not so useful if we can use Google's API data.

Feel free to comment on the points or add more points. Oh, and also we should have a look in the literature.

Juan: https://arxiv.org/pdf/1711.10684.pdf I found this paper (probably also you) and tried their model on our data. In particular I copied the model they had in a repo (which is different from the one in the paper, also tried this one) played with kernel sizes and depth. The performance wasn't better than our baseline, and in fact the model is very similar to what Jonas already implemented. I believe we are in the right path taking U-NET as base model, but in case we get stucked we can use some of this other state of the art methods as inspiration https://medium.com/@arthur_ouaknine/review-of-deep-learning-algorithms-for-image-semantic-segmentation-509a600f7b57

jonashein commented 5 years ago

Juan: Yes you are totally right. I've already tested this, reducing the depth of U-NET by 1, and the results were very close. We could also try cropping the training images (which would also work as data augmentation) to 384 so this way we would keep having depth 5.

juanlofer commented 5 years ago
jonashein commented 5 years ago

I believe that some postprocessing might help. We can somehow exploit the geometry of roads by thickening/extending lines in the predicted mask

tl;dr: Looking at the predictions of the baseline-cnn, I also think that this is good idea. The width of the predicted roads varies a lot and often the roads are interupted. There are many parallel streets and 90° intersections in the images which we could try to exploit.

I'll attach one example below with the original image, the prediction and a thresholded prediction (which I just created manually to check which roads are detected also with a very low probability). In this test image, the roads on the parking space are all detected, but the model assigned a very low probabilty to most of the parallel roads, which is why they are barely visible in the prediction. Only in the thresholded prediction (threshold=4, i.e. a probability of 4/256 = 1.5%), they are clearly visible (of cause the thresholded image is a bit over the top, it's just for illustration purposes). From this single example and a couple of other test images that I looked at in the last ~30min~ ~hour~ two hours, I'd say that the model is definitely tending towards predicting the background-class, i.e. we probably have more false negatives than false positives. I couldn't find a good example of false positives, but false negatives are in almost every image. However, I didn't work on this in the last weeks and I only had a look at a fraction of the test images (and none of the training/validation images), so take this with a grain of salt.

Of cause we can create an ensemble out of multiple models.

Original Test_25 image: test_25_image Baseline-CNN prediction: test_25_prediction Thresholded prediction: test_25_threshold4

juanlofer commented 5 years ago

I see, then we could try lowering the value of foreground_threshold for a previous submission to have an idea of the impact of this. Have a look at two sets of predictions, the left one has more background and smoother than the one on the right. I didn't write down to which models these outputs belong, so I can't tell what produced the differences. It could be the depth of the network, the input size (128/256), the loss (dice or combined) or just the number of epochs.

image

jonashein commented 5 years ago

The predictions on the right look like black and white images, as they have only strong gradients and no gray pixels (i.e. ~50% probabilities), In comparison, the model on the left is less sure about what is part of a road and what isn't, but it seems to detect more roads than the one on the right, even if these roads have a relatively low probability. The one on the right looks like overfitting to me, what do you think?

Juan: I am not sure if there is overfitting or not. What I'm really confused about and would explain why these apparently very different predictions get similar scores in Kaggle is that in the end mask_to_submission rounds off predictions. This could also explain why the validation score is much better, since it takes uncertainty into account. I have just tested submitting a prediction with probabilities and it doesn't throw any errors, but perhaps it just rounds off predictions internally.

About the postprocessing: One idea that I have is to exploit the grid-like layout by looking at the major orientation of the gradients in the predicted images as well as the gradients which are orthogonal to this orientation (i.e. via Hough transformation? idk, it's late). Both the major gradient orientation as well as the orthogonal gradient orientation are basically "voting" for the same grid orientation. If we find this dominant grid orientation, we could boost all gradients which are aligned with this grid. Of cause we cannot surpress any gradients which do not align with this grid (since there might be diagonal roads), but this way we could probably improve the detection of (straight) parallel and orthogonal roads. (I hope it's clear what I mean 😄 )

Juan: Never heard of Hough transformation, but I think your idea is good 👍. Another posibility, probably infeasible, is to use another ML method to transform predictions, like for example style transfer. I would bet we would loose a lot of accuracy there but would be super cool if it works. We could also try to introduce a penalty/reward term in the loss, though maybe doing this pixelwise is not posible.

Btw, I forgot to comment on one of your comment, regarding my idea to reduce the output image size:

Juan: I think this way we would be missing a lot of information. My intuition is that the bigger the patches the better, cause there is more context for the network to extract patterns such as lines or circles.

I agree that we shouldn't reduce the input image size (since the context is very important), but we could still try to remove the deconvolutional layers and just output a 38x38 image (i.e. output one pixel for each 16x16 patch, and skip this postprocessing/mask_for_submission algorithm). But as I said, neither am I not sure if that's allowed (since we would not predict full-size images anymore), nor if it improves the performance. Technically, we'd have less trainable parameters and less layers to train.

Juan: I'm not sure we would get a meaningful prediction, but it won't hurt trying. But also assuming it works, we would be loosing accuracy when downscaling the masks for training right?