BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.04k stars 18.7k forks source link

Dense per pixel label output #1019

Closed ankurhanda closed 7 years ago

ankurhanda commented 10 years ago

Hi everyone,

I have finally unit tested the changes I made to caffe-public to enable per pixel label for the whole image. I realised (so far at least) that only inner product and softmax layers needed the densified versions of forward and backward passes. These changes now pass the unit tests that come with the repository. I think it might be very useful for both me and others if they wish to use this code at some point, to have this feature tested thoroughly by the community or those who would like to, to spot remaining bugs if any. I will hopefully be releasing these codes at some point with our research work anyway later this year end or next year.

Also, looking at the growing research of dense semantic labels using conv-nets for image segmentation or indoor scene understanding, I presume someone else might have done this conversion as well. I wonder what's the best way to collate the contributions? Also, has anyone used depth image input (e.g. NYU dataset) to the conv-net to do indoor scene understanding?

Look forward to hearing from you.

Thank you.

BlGene commented 10 years ago

Hi Ankur,

This sounds interesting, I am working on a very similar problem and choose to start by re-implementing the approach taken by [1] using caffe. I would be interested to know how your approach works, are you learning a convNet for the whole image and then back propagating labels or are you windowing the images? Its quite difficult to say much without implementation details and code.

As I am quite new to caffe I will open a new issue to outline how I plan to re-implement the paper in order to get feedback on the best way to do this. Maybe you can comment on this too.

Not to state the obvious but caffe development is quite quick, so if you plan to release code at some later point make sure not to diverge too much from the dev branch.

1) I wonder what's the best way to collate the contributions? Probably to create a branch with the added functionality together with an example that makes use of it. After that you can discuss with people if this is a sensible implementation and make necessary improvements until its mergeable.

2) Also, has anyone used depth image input (e.g. NYU dataset) to the conv-net to do indoor scene understanding? No but I am planning to.

BR, Max [1] "Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers"

BlGene commented 10 years ago

see here for my work.

ankurhanda commented 10 years ago

Hi everyone,

Following up on that, I wrote up a new layer "Upsample", that takes the parameter "scale" as the input and rescales the low dimensional n-channel image by scale. My code compiles but however, upon running the training with network architecture which has another layer now

layers { name: "upsample1" type: UPSAMPLE bottom: "ip2" top: "ups1" scale: 1 }

I get the following error: Message type "caffe.LayerParameter" has no field named "scale".

The UpsampleParameter is declared in caffe.proto as follows

// Message that stores parameters used by UpsampleLayer message UpsampleParameter { // UpsampleLayer computes outputs y = Sx, S is the upsampling matrix. optional float scale = 1 [default = 1.0]; }

Could someone please take a look and advise me on how to diagnose this error?

Regards, Ankur.

BlGene commented 10 years ago

You probably either need to add the "scale" variable directly to LayerParameter, or add the UpsampleParameter to LayerParameter, like for example AccuracyParameter is in the line: "optional AccuracyParameter accuracy_param = 27;"

ankurhanda commented 10 years ago

I had done that already and in spite of that, I get the error that the scale parameter isn't defined.

BlGene commented 10 years ago

Then maybe try something like this in your layer.

upsample_param*{ scale: 1 }

*or whatever you called it

ankurhanda commented 10 years ago

I tried this and this one works! Great help. Many thanks - I could have spent the whole day.

BlGene commented 10 years ago

Np :)

ankurhanda commented 10 years ago

Hello BIGene,

I wonder if I could get your email address to contact and discuss the possibilities of merging your additions with the one I'm working on. We both are aiming at similar ideas and codebase thought it might be a good idea to get a contact! my webpage is here: http://www.doc.ic.ac.uk/~ahanda/

BlGene commented 10 years ago

Sounds good, for now most of my work is located here

dasguptar commented 9 years ago

Hi @ankurhanda, @BlGene,

I am actually planning to implement semantic segmentation/scene parsing using convolutional networks. So far, I have only used CNNs, and Caffe by extension, for image classification purposes.

I was wondering if I could collaborate with you guys, since I am kind of clueless as to how to start implementing or modifying the existing code to enable dense feature extraction at each pixel of an image.

I understand that each pixel is now a data sample, and a window/patch surrounding the pixel is sent to the CNN to be classified as one of the possible classes. But the naive way of simply sliding the window/patch across the image would be too inefficient and time-consuming, right? What would be a possible remedy to do this more efficiently?

BlGene commented 9 years ago

Am 28.09.2014 um 22:16 schrieb Riddhiman Dasgupta notifications@github.com:

Hi @ankurhanda, @BlGene,

I am actually planning to implement semantic segmentation/scene parsing using convolutional networks. So far, I have only used CNNs, and Caffe by extension, for image classification purposes.

I was wondering if I could collaborate with you guys, since I am kind of clueless as to how to start implementing or modifying the existing code to enable dense feature extraction at each pixel of an image.

See the net surgery Python example. I understand that each pixel is now a data sample, and a window/patch surrounding the pixel is sent to the CNN to be classified as one of the possible classes. But the naive way of simply sliding the window/patch across the image would be too inefficient and time-consuming, right? What would be a possible remedy to do this more efficiently?

There are several approaches to scene parsing, you should probably first have a look around to see what people have done before, the overfeat paper is quite recent so maybe have a look at it, as well as the other papers it cites. It would probably be a safer bet to choose one of these approaches and to reimplement it.

Best of luck, Max

— Reply to this email directly or view it on GitHub.

dasguptar commented 9 years ago

HI @BlGene, Thanks for the tips. I have seen the net surgery example, and I understand it explains how to extract dense features. I was actually thinking of implementing something like [1]. Most papers I have read simply state that they have trained a convnet for dense feature extraction, but skim over the actual details of the implementations. Could you suggest some papers related to scene parsing that employ convnets so that I could have a look? From what I have seen, none of the papers on this topic have released their source code, which makes it difficult to understand how they tackled problems like sampling pixels to account highly skewed class frequencies.

[1] Farabet et al, "Learning Hierarchical Features for Scene Labeling", Pattern Analysis and Machine Intelligence

BlGene commented 9 years ago

You picked the paper I am working with too :). IIRC the paper mentions that they frequency balance categories while training their first neural net.

Best Regards, Max Argus

Am 28.09.2014 um 23:17 schrieb Riddhiman Dasgupta notifications@github.com:

HI @BlGene, Thanks for the tips. I have seen the net surgery example, and I understand it explains how to extract dense features. I was actually thinking of implementing something like [1]. Most papers I have read simply state that they have trained a convnet for dense feature extraction, but skim over the actual details of the implementations. Could you suggest some papers related to scene parsing that employ convnets so that I could have a look? From what I have seen, none of the papers on this topic have released their source code, which makes it difficult to understand how they tackled problems like sampling pixels to account highly skewed class frequencies.

[1] Farabet et al, "Learning Hierarchical Features for Scene Labeling", Pattern Analysis and Machine Intelligence

— Reply to this email directly or view it on GitHub.

dasguptar commented 9 years ago

Hi @BlGene,

Another probably dumb question. So this is what I have thought of doing, and I was wondering if you could let me know if I'm heading in the right direction:

  1. Define a convnet model with fully convolutional layers to extract dense features.
  2. Replicate this model to different scales, and enable weight sharing by setting appropriate 'params'.
  3. Concatenate all dense features from the different convnets to pass to a softmax classifier.
  4. Frequency balance the pixels from each category during training. This could be done using something like a WindowDataLayer I believe.

Am I missing something, or is this alright?

Thanks a ton for your help.

BlGene commented 9 years ago

Seems to be a reasonable approach, I'm no expert on Caffe though so you are probably asking the wrong guy.

Best Regards, Max Argus

Am 29.09.2014 um 06:30 schrieb Riddhiman Dasgupta notifications@github.com:

Hi @BlGene,

Another probably dumb question. So this is what I have thought of doing, and I was wondering if you could let me know if I'm heading in the right direction:

  1. Define a convnet model with fully convolutional layers to extract dense features.
  2. Replicate this model to different scales, and enable weight sharing by setting appropriate 'params'.
  3. Concatenate all dense features from the different convnets to pass to a softmax classifier.
  4. Frequency balance the pixels from each category during training. This could be done using something like a WindowDataLayer I believe.

Am I missing something, or is this alright?

Thanks a ton for your help.

— Reply to this email directly or view it on GitHub.

JianlongFu commented 9 years ago

Hello @riddhiman-dasgupta,

I also wonder how to implement the details in the ref[1]. From my understanding after reading the paper, I think they considered each patch around a pixel as the input to a CNN net. But it is definitely an inefficient way in training. How about your idea?

Thanks and best, Jason.

gaobb commented 9 years ago

hi,

Thanks to your answer, I think the problem can be considered as regression problem.

Happy new year!

Best regards! Bin-Bin Gao


From: JianlongFu [mailto:notifications@github.com] To: BVLC/caffe [mailto:caffe@noreply.github.com] Cc: gaobb [mailto:gaobb@lamda.nju.edu.cn] Sent: Mon, 29 Dec 2014 05:03:55 +0000 Subject: SPAM Re: [caffe] Code ready for dense per pixel label output (#1019)

Hello @riddhiman-dasgupta,

I also wonder how to implement the details in the ref[1]. From my understanding after reading the paper, I think they considered each patch around a pixel as the input to a CNN net. But it is definitely an inefficient way in training. How about your idea?

Thanks and best, Jason.

— Reply to this email directly or view it on GitHub.

jingweiz commented 9 years ago

Hi @ankurhanda Is it possible for you so share the Upsample Layer you wrote? Thanks in advance!

shelhamer commented 9 years ago

1615 adds a deconvolution / backward convolution layer that can do upsampling.

yuLiu24 commented 9 years ago

Hi @ankurhanda How did you convert the ground truth images? with levelDB? in fact, I donot know how to save these GT images and compare them with the final conv outputs? thanks

igorb871 commented 9 years ago

Hi @ankurhanda ,

Is your layer available yet? I'm doing a class project on classifying object in maps and I think it could be really valuable.

Thank!

je310 commented 9 years ago

Hello @ankurhanda, did you manage to find out how to arrange the ground truths to be tested against a fully convolutional output?

shelhamer commented 7 years ago

Closing as this is handled by fully convolutional networks and their implementation in Caffe through reshaping, coordinate mapping, and cropping. See fcn.berkeleyvision.org for the reference implementation.