Open jonashein opened 5 years ago
Juan: Yes you are totally right. I've already tested this, reducing the depth of U-NET by 1, and the results were very close. We could also try cropping the training images (which would also work as data augmentation) to 384 so this way we would keep having depth 5.
I believe that some postprocessing might help. We can somehow exploit the geometry of roads by thickening/extending lines in the predicted mask
tl;dr: Looking at the predictions of the baseline-cnn, I also think that this is good idea. The width of the predicted roads varies a lot and often the roads are interupted. There are many parallel streets and 90° intersections in the images which we could try to exploit.
I'll attach one example below with the original image, the prediction and a thresholded prediction (which I just created manually to check which roads are detected also with a very low probability). In this test image, the roads on the parking space are all detected, but the model assigned a very low probabilty to most of the parallel roads, which is why they are barely visible in the prediction. Only in the thresholded prediction (threshold=4, i.e. a probability of 4/256 = 1.5%), they are clearly visible (of cause the thresholded image is a bit over the top, it's just for illustration purposes). From this single example and a couple of other test images that I looked at in the last ~30min~ ~hour~ two hours, I'd say that the model is definitely tending towards predicting the background-class, i.e. we probably have more false negatives than false positives. I couldn't find a good example of false positives, but false negatives are in almost every image. However, I didn't work on this in the last weeks and I only had a look at a fraction of the test images (and none of the training/validation images), so take this with a grain of salt.
Of cause we can create an ensemble out of multiple models.
Original Test_25 image: Baseline-CNN prediction: Thresholded prediction:
I see, then we could try lowering the value of foreground_threshold for a previous submission to have an idea of the impact of this. Have a look at two sets of predictions, the left one has more background and smoother than the one on the right. I didn't write down to which models these outputs belong, so I can't tell what produced the differences. It could be the depth of the network, the input size (128/256), the loss (dice or combined) or just the number of epochs.
The predictions on the right look like black and white images, as they have only strong gradients and no gray pixels (i.e. ~50% probabilities), In comparison, the model on the left is less sure about what is part of a road and what isn't, but it seems to detect more roads than the one on the right, even if these roads have a relatively low probability. The one on the right looks like overfitting to me, what do you think?
Juan: I am not sure if there is overfitting or not. What I'm really confused about and would explain why these apparently very different predictions get similar scores in Kaggle is that in the end mask_to_submission rounds off predictions. This could also explain why the validation score is much better, since it takes uncertainty into account. I have just tested submitting a prediction with probabilities and it doesn't throw any errors, but perhaps it just rounds off predictions internally.
About the postprocessing: One idea that I have is to exploit the grid-like layout by looking at the major orientation of the gradients in the predicted images as well as the gradients which are orthogonal to this orientation (i.e. via Hough transformation? idk, it's late). Both the major gradient orientation as well as the orthogonal gradient orientation are basically "voting" for the same grid orientation. If we find this dominant grid orientation, we could boost all gradients which are aligned with this grid. Of cause we cannot surpress any gradients which do not align with this grid (since there might be diagonal roads), but this way we could probably improve the detection of (straight) parallel and orthogonal roads. (I hope it's clear what I mean 😄 )
Juan: Never heard of Hough transformation, but I think your idea is good 👍. Another posibility, probably infeasible, is to use another ML method to transform predictions, like for example style transfer. I would bet we would loose a lot of accuracy there but would be super cool if it works. We could also try to introduce a penalty/reward term in the loss, though maybe doing this pixelwise is not posible.
Btw, I forgot to comment on one of your comment, regarding my idea to reduce the output image size:
Juan: I think this way we would be missing a lot of information. My intuition is that the bigger the patches the better, cause there is more context for the network to extract patterns such as lines or circles.
I agree that we shouldn't reduce the input image size (since the context is very important), but we could still try to remove the deconvolutional layers and just output a 38x38 image (i.e. output one pixel for each 16x16 patch, and skip this postprocessing/mask_for_submission algorithm). But as I said, neither am I not sure if that's allowed (since we would not predict full-size images anymore), nor if it improves the performance. Technically, we'd have less trainable parameters and less layers to train.
Juan: I'm not sure we would get a meaningful prediction, but it won't hurt trying. But also assuming it works, we would be loosing accuracy when downscaling the masks for training right?
Hey guys, I want to use this issue to give you a short overview of the work I did so far on the baseline-cnn. After I found the tutorial + code (see the link in the README.md), my main goal was to get it running on our data set in order to get a first idea of how good it performs.
Since the model performed really good out of the box, we could use it as a foundation and try to improve it. While working on the code, I found several ways to improve the model. In this issue I want to document them so that we can work on them later. Feel free to open new and separate issues for some of the items, I'll just give an overview here:
Feel free to comment on the points or add more points. Oh, and also we should have a look in the literature.