Semantic Segmentation Functionality

davidmascharka commented 8 years ago

I'm interested in using dlib for semantic segmentation. I think the only necessary features would be:

Loss function - it doesn't look like loss_multiclass_log would work for this
"Unpooling" upsampling layer - there are a few ways to do this

Is this something you're interested in supporting? I'm planning to add this functionality regardless, just wanted to check in.

Regarding implementation, would the upsampling be something you would want added to the pooling class or its own upsample class?

davisking commented 8 years ago

Those are definitely things we want in dlib. As you say, you just need a new loss layer and some kind of upsampling layer. Definitely create new layer classes rather than adding more stuff into existing classes.

davisking commented 8 years ago

You mean you want to implement the things @davidmascharka asked about specifically? Or something else?

davisking commented 8 years ago

I would start by implementing an upsampling layer.

reunanen commented 7 years ago

Should loss_mean_squared_multioutput basically work for binary semantic segmentation tasks? I do understand that it wouldn't be an ideal choice for classification – but other than that?

I too really want to do semantic segmentation using dlib. In fact I am almost desperate: I have already created a network that can classify individual image patches of size 28x28 or so. I can then convert any image to a list of patches, and classify each patch. This more or less gives the desired result, however it is terribly slow of course. But as said, I am desperate...

Any improvement over the above approach would be great for me. I would be willing to contribute also, however I'm not quite on a level where I can just start typing. If the stars were aligned however, I might be able to take loss_multiclass_log and loss_mean_squared_multioutput, and combine these two to a loss_multiclass_log_multioutput (or something like that), if that makes sense? (I am looking at the loss layer mainly because it looks like there's already some significant progress on #336.)

davisking commented 7 years ago

You could do that. But I would code up some loss from a recent paper that got good results. I doubt MSE is what gives the best results in recent papers.

davidmascharka commented 7 years ago

I've had good results in segmentation from the mean of a pixelwise multiclass log loss. That might be a way to start -- adapting loss_multiclass_log to take the mean over pixels.

reunanen commented 7 years ago

Thanks for the feedback. Already started to look at implementing a loss_multiclass_log_multioutput – let's see if that will lead to anything.

davisking commented 7 years ago

There is now a cont_ layer in dlib (from this PR https://github.com/davisking/dlib/pull/476). It does the "deconvolution" layer you need to do the sort of upsampling needed for these tasks.

OranjeeGeneral commented 7 years ago

In my humble opinion PixelNet (http://www.cs.cmu.edu/~aayushb/pixelNet/) seems to me like a far better approach to semantic segmentation simple because fully convolution network are super memory intense in training. This gets around it by having simply a normal MLP part at the end and just sample randomly pixel locations.

I already made some way into getting the required hyper column layer into dlib which is not that easy as it seems.

reunanen commented 7 years ago

With #476 and #540 now merged, wouldn't it be great to have a complete, self-contained example program that solves some semantic segmentation problem?

davisking commented 7 years ago

You read my mind. Do you have one ready you want to contribute? :)

reunanen commented 7 years ago

No – unfortunately I have nothing ready. I also haven't even tried the upsampling layer yet. That said, I'm willing to contribute, but it may take a while before I get myself to do it. So if anyone is able to put something together sooner, would be great.

headupinclouds commented 7 years ago

I'm also interested in this, and would like to help, but might be more willing than able in the very near term -- I have a few things to catch up on. I really like the C++ framework, btw 😄. I will share any updates as I have them, and can help with testing for any recent smaller network. I'm currently looking at this squeezenet based model, which seems to be a good balance of size and performance, and is at least somewhat mobile friendly. The reference network seems to be worked out here: https://github.com/antikantian/facedet-squeezenet/blob/master/squeezenet_train.cpp, but there are a number of pieces missing (branching + refinement modules, etc.). Maybe a more standard hourglass network, for which the pieces are more or less available, would make a better initial example/tutorial.

davisking commented 7 years ago

Sounds good. Any example program that shows an easy to follow but interesting way of doing segmentation would be great. It's on my short list of things to do as well. I'm keeping my GPUs busy on object detection experiments right now so I'm not going to do it immediately. I'm also probably going to add some densenet layers first as well.

OranjeeGeneral commented 7 years ago

@headupinclouds

Looks interesting as far as I see you need an ELU layer (been added to CuDNN8) and dilated convolution. CuDNN already has dilated option for the convolution I think. Shame they don't mention how easy it is to train from scratch (it is not always easy to get your hand on the full imagenet dataset to pre-train your network) and why there are four fire modules at the end of the encoder.

headupinclouds commented 7 years ago

@OranjeeGeneral : Thanks for the pointers.

OranjeeGeneral commented 7 years ago

@headupinclouds

I am keen to try this type of segmentation network out as an alternative to what I am currently trying. SegNet feels way too wasteful and I don't like the structure of ENet as a thinner alternative. This one looks more elegant. So I am keen to know how you're getting on. Do you have a good segmentation training set?

I also will be looking into adding those layers into my fork. ELU should be easy. Dilated Convolution a bit trickier as the current conv layer would need to be extended.

reunanen commented 7 years ago

I started to add an example where a segmentation net is trained on the PASCAL VOC 2012 dataset.

If you are interested, see the currently latest commit in this branch. There are instructions where to download the data and how to run the program(s). It's not yet ready for a PR to master, but maybe you'll nevertheless want to use it instead of starting from the scratch. At least reading in the dataset might be useful. On the other hand, the net structure surely should be improved – I admit I was rather clueless writing it. Also, I am a little handicapped from the GPU resources point of view, so I haven't really been able to train it for very long yet. So it is quite likely that the results really are not great at all. But even so, would be great to have a more or less unified test setup where different network structures could be compared relatively easily. Not saying my branch has to be the foundation for such a test setup, but nevertheless it's available in case anyone wants to try it out and maybe improve it further. (Although I'll be testing and hopefully improving it too.)

headupinclouds commented 7 years ago

@OranjeeGeneral

I am keen to try this type of segmentation network out as an alternative to what I am currently trying.

Great. Maybe this is a good candidate for the example section. Let's try it.

Do you have a good segmentation training set?

I prefer a simple single class problem for a core example. I think something similar to this portrait segmentation application would be of general interest in a future release. (There are links to training data.) The PASCAL VOC 2012 sample @reunanen posted above seems to be the standard for this domain, and it would also be a nice addition. I will most likely grab that code as a starting point (minus VOC specific stuff).

This is a pretty casual project for me. I have no pressing need for the functionality, so I'll be spending time on weekends and such. Please don't block on my account. I scanned some of the current layers and sketched out the CPU ELU layer. The GPU part should be fairly easy based on your pointers above. I can probably push that this weekend. I haven't looked at dilated convolution at all. I'll share any updates as I have them.

davisking commented 7 years ago

Definitely, we want any example program to be runnable in a reasonable amount of time and not require downloading large datasets. So smaller examples are good.

OranjeeGeneral commented 7 years ago

My issue with PASCAL VOC 2012 as a dataset I am not sure if it is actually big enough to train a full network from scratch to get any meaningful/useful results. But I guess one has to try first. Thanks for the reading code that sure comes handy (@reunanen)

davisking commented 7 years ago

Well, it doesn't really matter if the model generalizes. You should think of the example like an essay that explains something. It has an introduction, a middle part, and some conclusion. It's also executable so users can start with a program that works and edit if from there. It also illustrates the code patterns and general way things are designed to work together, which might not be obvious from reading the API reference documentation for each component separately.

The point of the examples isn't to be a useful tool, but rather an educational document. Part of being useful as an educational document is that someone can run them in a reasonable amount of time. If it takes a really long time to train then that's not as good as an example that runs in a few minutes. To that point, you could make a segmentation example that trained on one image and badly overfit to it. That's fine, so long as the example explains this and the general issues around semantic segmentation in a clear way.

OranjeeGeneral commented 7 years ago

@headupinclouds

Sorry had been offline for quiet a while due to some personal issues but I am now back on following up on this.

Not sure if you made any progress in the meantime as well but I've got an implementation of an ELU layer using either CuDNN >=6 or if not available my own CUDA kernel. Also nearly there with dilated convolution that will only work with CuDNN >=6 this one is a bit more tricky as I think it introduces a few new constraints on the convolution. But dilated convolution is used in so many different segmentation networks I've come across so I think it might be worth the pain.

headupinclouds commented 7 years ago

@OranjeeGeneral : That sounds promising. No progress here. Haven't managed to fit it in.

OranjeeGeneral commented 7 years ago

Well I updated my fork with it. Contains the new ELU layer and the update convolution layer. The dilation convolution only works with CUDNN6 for now. I have a bit of trouble getting a CPU version working I am not sure if I have to rewrite larger bits underneath which I am bit reluctant to do.

Will try to build up a combined example between the SqueezeNet and reunanen's VOC2012 segmentation example code next

OranjeeGeneral commented 7 years ago

@headupinclouds

Hmm I've got a SqueezeNet-Segnet setup from that paper kind of now running but there is something odd happening, using the ELU layer seems to cause the gradients to go to NaN after the first training batch so I decided to go back to the original SqueezeNet-V1.1 implementation which was using relu.

I think it probably might be a good idea to add batch normalization to all convolution layers as well.

OranjeeGeneral commented 7 years ago

Well after 4 days of training the network (I think I was a bit too passive on the training parameters) using the combined traineval set of PascalVoc2012 I have to say the results are not a total disaster you actually see some decent labeling. But there is some trouble with the SqueezeSeg network structure and my implementation in dlib. Not sure how to solve it as it uses inception layers to mix upscaled layers with previous intermediate conv layers layers but the tensor sizes don't always match for arbitrary input image dimensions simply due to different rounding/integer math which then causes assert failures in the inception layers / copy tensor.

headupinclouds commented 7 years ago

Well after 4 days of training the network

Sounds painful. This is still encouraging. Thanks for moving it forward. I'd be interested in trying it this weekend for a small single class problem (now that you've done all the heavy lifting). Do you have a pointer you can share?

the tensor sizes don't always match for arbitrary input image dimensions simply due to different rounding/integer math which then causes assert failures in the inception layers / copy tensor

If the supported sizes are documented, maybe in combination with dlib asserts, that would seem to be a reasonable limitation.

OranjeeGeneral commented 7 years ago

Well as I said I think my learning parameters are probably a bit too conservative but it is known that segmentation networks do take a lot of training time especially if they are learned end2end from scratch. Anyway I will update push with my test program to my clone in the next days so you can have a look at it if you like. But beware it isn't the prettiest or most clean example I just hacked it together to see if it works at all. And lots of credits to reunanen as well I took his Voc2012 reader code.

the tensor sizes don't always match for arbitrary input image dimensions simply due to different rounding/integer math which then causes assert failures in the inception layers / copy tensor

If the supported sizes are documented, maybe in combination with dlib asserts, that would seem to be a reasonable limitation.

I am not sure about that I think this inception layer or in this case the copy tensor copy code might have to be loosened up a bit here to accept tensors that are slightly different in shape and then copy the smaller dimension and leave the rest blank. As when you have encoder/decoder networks that share network layers between encoder/decode parts you often might end up with these kind of situation.

vrodionovpro commented 7 years ago

Hi everyone. May be some can show simplest working example os segmentation?

I build reunanen's exapmle dnn_semantic_segmentation_train_ex.cpp (windows 10, cuda) and it fails then training on 2-3th step and only miracle can do it correctly.

May someone can start it?

OranjeeGeneral commented 7 years ago

@vrodionovpro

You can checkout my branch I've got a "working" example of SqueezeSegNet it trains currently on VOC2012 dataset (be warned it trains for days) but that dataset is way too small unfortunately. But I found a better dataset in the meantime and probably going to rewrite that example for this new dataset in a couple of days (it has 17k labled images) that should produce far better example and I am going to tweak some parameters.

vrodionovpro commented 7 years ago

It's all cool. But it does't work. It compiling too long (all night and cannot finish). Intel Xeon 2gh 24gb ram. Win 10 cuda 8.

But if I try simplify network it's fail then train_step like other examples of segmentation.

davisking commented 7 years ago

You must be using visual studio 2017. Visual Studio 2017 has a bug in it that causes the compiler to hang forever when it hits certain C++11 code. This is a bug in visual studio. You should use visual studio 2015 which, surprisingly, has better C++11 support and won't hang.

vrodionovpro commented 7 years ago

Unfortunately I used vs2015 with c++ 11 compiler Update 3. Template for with network is too hard for compiler.

Do you planning Davis include simplest working example of segmentation in next version of dlib?

reunanen commented 7 years ago

@vrodionovpro Not sure what exactly you are trying to do. But this branch (whose segmentation network structure I just simplified a little) at least should compile in VC++ 2015 (using MSBuild; the 32-bit compiler invoked by the IDE most probably can't allocate enough memory) and do some segmentation pretty much out of the box (although you'll need to load the Pascal VOC 2012 image data set – short instructions are included).

@davisking Should I make a PR of this? It includes at least the VOC 2012 data parser, and some network architecture for doing semantic segmentation. I'm not saying the architecture is great (in particular I'm a little lost regarding whether the residual upsampling makes any sense – I really just copied it from the downsampling counterpart, without giving it almost any thought), but at least it does something relatively meaningful, and could therefore perhaps serve as a baseline for any future segmentation improvements that different people will hopefully come up with.

reunanen commented 7 years ago

But this branch (whose segmentation network structure I just simplified a little) at least should compile in VC++ 2015

It compiled fine on one machine, but appeared to take very long on another (both Visual Studio 2015). So created this even more simplified version.

@vrodionovpro, feel free to try also this new branch.

davisking commented 7 years ago

Maybe the time it worked you were using the older version of visual studio 2015 (See http://dlib.net/faq.html#WhycantIusetheDNNmodulewithVisualStudio)

Sure, you can make a PR if you want. But if you do you need to turn the example into something that is more like an essay than a program. A good example program has an introduction, then some walk through that explains some concept. The reader should be able to read it from top to bottom and be reading come coherent narrative that explains whatever the example is supposed to illustrate. That means in particular that a good example has comments with full sentences and paragraphs when appropriate :)

reunanen commented 7 years ago

Maybe the time it worked you were using the older version of visual studio 2015

Actually the versions are identical (14.0.25431.01 Update 3). But maybe I just wasn't patient enough on the other machine. I guess the compiler just ran more slowly for some reason (swapping to disk, or something – the slower machine does have less RAM).

Sure, you can make a PR if you want. But if you do you need to turn the example into something that is more like an essay than a program.

Got it, thanks.

davisking commented 7 years ago

Oops, I meant to say, maybe the time it didn't work you were using an older version.

But in anycase, VC2015 is also flaky with C++11, it's just less flaky than VC2017. Also, for a non-broken compiler like clang or gcc the compile times are negligible. I can't recommend using some other compiler highly enough.

reunanen commented 7 years ago

Yeah, sure. But if you're stuck on Windows (for whatever reason), then there's no choice, right?

davisking commented 7 years ago

Yeah, I know you are stuck. I was saying that more for other readers :)

reunanen commented 7 years ago

Yep, just wanted to confirm that it is still the case that there's no choice. Would happily try something else. :)

davisking commented 7 years ago

Yeah, I think if you want to use CUDA on windows you have to use Visual Studio unfortunately.

reunanen commented 7 years ago

Btw, for other readers who may be stuck on Windows: it helps a lot to create compilation firewalls. Put all the network definitions in a file that your other application logic doesn't see. That way, you can at least change most of your software without having to wait for hours for the compiler to finish. (Unless you want to, of course.)

davisking commented 7 years ago

Wow, do you really wait for hours?

That's awful. I literally recompile these things over and over. I don't even think about it because GCC isn't slow. I get annoyed if the compile takes more than like 20 seconds.

reunanen commented 7 years ago

Well, maybe not hours. Usually something like half an hour to one hour, perhaps.

It is annoying, but you learn to plan your work accordingly. If you really want to try different network architectures, then the actual training time still dominates (usually) – so anyway you leave it running overnight, or something like that. OTOH, if you develop other parts of your program, or even change some of the NN hyperparameters (such as the learning rate), then the compilation firewall means that the test cycle is fast (or at least is not slowed down by this problem).

Moreover, at least for me the compiler always says it quickly if there's a syntax error (like not enough or too many >s). So if the compiler doesn't report an error in a few seconds, I can pretty safely leave it compiling – once done, a script will start the training, which (usually) takes hours anyway.

Not that I wouldn't want it to compile more quickly, though.

reunanen commented 7 years ago

PS. I do agree it's awful. But it is still much better than doing DNNs without dlib. The work you have done is truly impressive.

davisking commented 7 years ago

Thanks :)

At some point maybe I'll add a non-templated network building API. I've thought about it, but it's a fair bit of work and the only reason I really want to do it is to work around visual studio and to make it easier for python users to load networks. Probably the visual studio devs will fix VC before I get around to it, but maybe it's worth it for python. We will see. I haven't really figured out how to make a non-template API that I like yet.

OranjeeGeneral commented 7 years ago

Hmm have you ever tried using ICC I am not sure if it is compatible with CUDA/CUDNN but usually ICC works with MSVC compiled libraries / dlls. Might be worth a shot.

So I finally came around to update my example. It is a bit more complicated I had to do some changes to Dlib so it won't work with the standard mainline. As I added Dilated Convolution and ELU support (although ELU isn't used as it doesn't seem to work for me on back propagation I get infinite gradients I yet have to investigate why that happens) and since I use a SqueezeNet architecture where layers from the encoder part are mixed with the decoder part I also had to change some asserts in the copy tensor code as the deconv layers do not necessarily ensure that the exact tensor size is restored. Here is the link to example if anyone is interested.

The training dataset I use the human parsing data set contains 17k images so about 3 times the size of VOC2012. Unfortunately it is a bit hard to track down so I might park it somewhere

OranjeeGeneral commented 7 years ago

While looking for alternative segmentation networks while I wait for training my current one, I stumbled across this rather intriguing solution:

https://arxiv.org/pdf/1702.08502.pdf

This looks like it simplifies the decoding part of the network significantly and it can easily be bolted onto an existing resnet network.

davisking / dlib

Semantic Segmentation Functionality #288