Closed copaah closed 7 years ago
That's correct.
Transfer learning is common practice and there are much better analyses than anything I might be able to blurt out here. ;) Here it goes though.
The reason it is preferred is that image classification is an easier task than semantic segmentation and the first layers store very basic information that is shared in most tasks in vision; therefore instead of trying to learn a complex task right away, solving a simple one first often helps. As classification is established as a popular and simple learnable task and ImageNet as a ubiquitous dataset for classification, learning a representation using both is almost a no-brainer: it's a standardized procedure (much more than segmentation) and it provides you with a means to check the performance of your network a second time. When that auxiliary loss converges, you can finetune the learned weights on another task; in this case, semantic segmentation.
I decided not to train on ImageNet but this is definitely something that could have been useful. Last time I run an evaluation script on enet, I got an AP@0.5 of less than 10% which lead me to believe there's a problem with my implementation. Since then, I have been busy with the preprocessing pipeline to make sure there's nothing wrong there and that has taken a large chunk of my time working on enet (I'm still not satisfied with the result). By the way, there are some modifications in the code that I am going through (I'm fixing the evaluation scripts), so expect some more commits till the end of the week and, if all goes well, a pretrained model.
As the next step, I should probably add support for the datasets from the paper (Cityscapes and CamVid) and run ENet on those to make sure I've not made some horrible, stupid mistake.
The current codebase uses the same weights that the original authors used for their implementation, therefore the question of end-to-end training (for practical reasons) is now more or less irrelevant, therefore I'm closing the issue. However, feel free to reopen it if you think it's necessary and/or ask more questions.
The model that I'm using right now has been pretrained on imagenet and and fine-tuned on cityscapes.
Technically I could add the option to pretrain the network on classification (using only the encoder) but that's a really low priority feature for me right now.
I see you are training the model end-to-end style, while, in the original paper, they train the encoder first in order to categorize downsampled regions and then they append the decoder afterwards. What are you thoughts on this? Do you have any intuition why it might be better to train it encoder-decoder style rather than end-to-end?