BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
33.97k stars 18.72k forks source link

Sliding Window, Varying input/output size and Dense, multiscale extraction #189

Closed akosiorek closed 7 years ago

akosiorek commented 10 years ago

[1] enables varying input/output size in order to perform multiscale multiview image processing so as to to bolster classification confidence and to perform localisation and object detection. I wonder if and how could it be implemented in Caffe?

One possibility would be to set blob sizes to their maximum expected values and then account for the actual input size during computation at each layer. I am not familiar enough with Caffe sources to predict the overhead this approach might cause. I imagine it can lead to redundant memory copying and involved index arithmetic in order to access the right data.

What are other possibilities? I would be happy to PR it should we be able to work out a decent solution.

[1] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. arXiv:1312.6229 [cs.CV].

nian-liu commented 10 years ago

I am also concerning about similar issues: Does caffe support multiple input data layers and the situations that multiple layers input to a higher layer and vice versa?

mavenlin commented 10 years ago

As for convolution, caffe processes image one by one. In this sense, the size of each image can vary, im2col buffer can be preallocated to fit the largest image. For innerproduct layer, the batch mode will no longer work. (But anyways, for network involving multiscaling there would be no innerproduct layer). Dropout is also not a problem, but I didn't read pooling code, no idea whether it is a problem.

shelhamer commented 10 years ago

@kosiorekadam For varying output size with input size, the inner product layers for classification can be made convolutional too, such that the network makes a spatial output map. This can be done in Caffe as-is with the proper network definition. We will try to include an example.

Dense, multiscale feature extraction (that's fast!) is afforded by convolutional architectures if done right, and has been done within the BVLC in Caffe. We hope to publicly release this enhancement before long.

@mavenlin while convolution is the bottleneck in the current pipeline, images of varying dimensions and scale can be accommodated in a single convolutional pass with the right indexing. Essentially, one packs a pyramid or image set into a "plane" for processing through the net. This amortizes the convolutional computation across windows. By careful indexing one can extract the features/output as if they were processed one-by-one.

@forresti is a BVLC member working on amortized computation, reduced memory usage, and further efficiency improvements to Caffe among other projects.

shelhamer commented 10 years ago

@nian-liu Caffe layers can have multiple inputs and outputs. Caffe networks can have any DAG (directed acyclic graph) structure #114 #129 , so many kinds of branching are supported. Although there aren't examples included yet, it is done by listing multiple outputs in the network definition, which are then automatically connected by inserting split layers.

For multiple inputs, you might find the concatenation layer helpful #125. This combines multiple input images into a single input blob. This could be used for example to process consecutive frames of video together.

akosiorek commented 10 years ago

I've done a little bit of code reading and as I understand both convolution and pooling layers can work with changing image sizes. They only have to be preallocated to fit the biggest image anticipated, just as @mavenlin mentioned.

However, this approach results in convolving and pooling a small image with a lot of padding (corresponding to the image of the maximum size). In order to narrow down the computation to the area of the currently processed image I need to store the size somewhere. I can feed the image to the network and compute the size after each layer inside the Net::forward or add a couple of fields to the Blob that would store the size. Of course I would have to change the API to allow input of different size than indicated in the layer_param. Am I correct?

kloudkl commented 10 years ago

Torch7 PyramidPacker does exactly what we want. @shelhamer, is your internal implementation in BVLC the same as Torch7? If it is and you cannot open source it shortly for some reasons, we understand you well and would like to implement one to benefit everyone as soon as possible.

shelhamer commented 10 years ago

@kloudkl the BVLC implementation is the same as the Torch7 PyramidPacker at least in spirit; I have not read the Torch7 code yet to compare the details.

Our pipeline is not quite identical, but it is a pack, plane, unpack method.

I agree it is time for dense extraction in Caffe. Since there are several design choices, it is unlikely that the implementation planned here will be identical to the private (and still experimental) implementation. I suggest we move ahead on a public implementation, and then we can compare and draw from the strengths of both implementations in the end. The BVLC pyramid team agrees with this path and will continue work on their implementation too.

PRs for dense + pyramid extraction are welcome!

sguada commented 10 years ago

@shelhamer making innerproduct layers into convolutional layers will slow down the process a lot. I made some tests by changing innerproduct layers to convolution layers with 4096 filters and the running time goes from 1.25 seconds per batch (of 256 227x227 images) to 4.37 seconds, so almost 4x slower.

When I increase the size of the images to 454x454 then I have to reduce the size of the batch to 128, otherwise it doesn't fit in the 12G of memory of the K40, and the then the time per batch is 4.23 seconds per batch, what means that the time to process 256 images would be 8.47 seconds. That would make that network impractical for training since it will take ~30 days, however it could be used for testing or deploy.

Maybe a different way to do the convolutions could help in that case. Also by changing the size of the inputs like #195 one could pass multiple scales independently instead of all together.

shelhamer commented 10 years ago

Thanks for the timing evaluation @sguada. The convolutional bottleneck is an important target for improvement. @forresti, I think you had some ideas for this?.

However, it's important to note the overall efficiency of this scheme. In the 454x454 case an 8x8 classification map is computed, so the convolutional fully-connected net is doing 64x the work in ~8x the time (I did this math in my head, so someone might check me on this).

Further, one need not necessarily densely compute the classifier. One could fuse the dense and selective approaches by densely extracting features (across space and scale as desired), then selectively computing the fully-connected layers at selective search from the image mapped into feature space coordinates.

Perhaps #194 might help alleviate the issue if instead of convolution fully-connected layers one tiles the inner product layer weights to compute the classification map as one massive multiplication, although this is of course wasteful in memory.

sguada commented 10 years ago

@shelhamer you math was almost correct, the final output map is 9x9, so in fact is doing 81x the work in ~8x the time. However there is something to look in the convolutional layer since when given the same image size and have to do exactly the same work it requires 4x the time.

However this approach would work for big images since the extra cost is amortized very quickly.

@forresti and me have been looking into how to speed up the convolution, but so far didn't success. However maybe for this case where there are a lot of filters with many channels it could work.

forresti commented 10 years ago

@sguada oh cool, thanks for doing some benchmarking with the convolutional fc6 and fc7.

To begin with, I'll see if I can discern why conv is slower than innerproduct for the standard 227x227 setup.

@kloudkl Do you have any thoughts on the computational efficiency of Torch7 for Alexnet and similar deep models? Are there any particularly interesting scenarios where Caffe is much faster or slower than Torch7?

rodrigob commented 10 years ago

It might be interesting to look into the implementation details of OverFeat since it is supposedly optimized for the "dense sliding window" use case. https://github.com/sermanet/OverFeat

rodrigob commented 10 years ago

Scratch my last comment for now OverFeat only released source code for the CPU version, and binaries for the GPU version (?!).

Then for now we can look at Torch's GPU code

https://github.com/torch/cunn https://github.com/torch/cutorch/blob/master/lib/THC/THCTensorConv.cu

rodrigob commented 10 years ago

A nice demo for this new feature would be a face detector similar to http://eblearn.sourceforge.net/face_detector.html

kloudkl commented 10 years ago

@forresti, I have read some codes of Torch7 but never run it. Unlike Caffe, only a small part of Torch7 is written in CUDA. Torch7 and Theano have been benchmarked against each other in the pre-Caffe era. The results largely depends on who does the benchmarking, when(both teams never stop optimizing the preformance) and on which GPU device they are benchmarked.

To inspire further discussions, I excerpt the following contents that can be found in many CUDA courses. To gain insight into the performance bottlenecks and root causes, the orthogonal method is profiling. Usual suspects are device utilization and memory bus utilization. The former can be tuned with launch config (#111). The latter is higher when the memory access is coalesced. Latency hiding technique can also increase throughput while warp divergence does the opposite.

If optimization becomes a really high priority, systematically studying the related professional techniques will help a lot.

shelhamer commented 10 years ago

I do not see Torch7 and Theano so much as guides for our computational pipeline and convolutional architecture, but as machine learning / deep learning libraries we can take as inspiration for features.

The central feature relevant to dense and pyramid processing in Torch7 is pyramid packing and unpacking. While optimization of the indexing, convolution, and fully-connected layers will be important for a widely-useful implementation, first we must have an implementation. From pyramid processing we can go in many directions, including for problems other than detection, and of course work on a Caffe reference implementation of OverFeat.

Thanks @kloudkl for the review of CUDA optimization and benchmark history. Perhaps we could have an "on CUDA optimization" section of the developer documentation to keep your pointers together.

The face detector highlighted by @rodrigob would be a nice demo for pyramid processing.

shelhamer commented 10 years ago

The BVLC pyramid team is working on integrating their implementation into dev ASAP. The only hold-up is the usual integration hacking and a license complication that is being hammered out now. Thanks all for your patience while this feature coalesces.

However, I stand by my original suggestion that a Torch7 style pyramid pack/plane/unpack method be pursued in the community so that we can analyze and improve on the differences. There are many design choices in such a feature.

shelhamer commented 10 years ago

Re: @forresti's https://github.com/BVLC/caffe/issues/189#issuecomment-37227810, the convolutional implementation is slowed by the roll / unroll and copy instead of straight dgemm as in the InnerProduct layer.

rodrigob commented 10 years ago

Thanks @shelhamer for looking into the topic. Any update on the BVLC pyramid integration plans ? It is there a branch where we can track progress on this topic ?

kloudkl commented 10 years ago

It is scheduled to be released in the milestone 1.0. There is no PR or branch to track yet. But anyone could feel free to develop one.

shelhamer commented 10 years ago

The BVLC pyramid team hope to make a public PR in the next week. That said, it was developed somewhat independently of Caffe and is going to take serious effort to integrate, so the appearance of the PR doesn't signal that the feature is ready.

My honest suggestion is that anyone interested pursue the Torch7 pyramid pack/plane/unpack line of thought. There's understanding and improvements to be had in comparing implementations. As @kloudkl noted, this is a milestone feature or us so we could help review and discuss any contributions in this direction.

One could even prototype it in python instead of coding it directly into the library to first understand the choices to make. For instance, the Torch7 packing when convolved together will not produce the same filter activations as running separate inputs; there will be border effects according to the kernel sizes. Likewise, how should one pad to avoid false edge responses along the negative space where no image is packed? Yet another issue is that a mean image will not work, unless it is scaled and applied to each packed image, and one might instead use a channel mean that is spatially uniform.

These options are worth exploring in more than a single thread.

kloudkl commented 10 years ago

Although already mentioned in a comment two weeks ago, but I think it is still very relevant and useful to post the links related to @clementfarabet's implemention.

  1. torch7-demos / face-detector / PyramidPacker.lua
  2. torch7-demos / face-detector / PyramidUnPacker.lua
  3. Purdue's demonstrative tutorial
rodrigob commented 10 years ago

Now DenseNet is out, thus we should be able to close this item soon ?

http://arxiv-web3.library.cornell.edu/abs/1404.1869

moskewcz commented 10 years ago

i just pushed the code DenseNet code public and opened a PR #308 (not #307, wrong target branch) to discuss the various TODOs and/or integration plans.

bhack commented 10 years ago

Please take a look here: http://arxiv.org/abs/1405.3866v1

shelhamer commented 10 years ago

455 might be of interest in the meantime. It shows how to make a fully-convolutional model for dense feature extraction or sliding window classification inference.

bhack commented 9 years ago

This could include also regression support on a variabile length set of bounding boxes coordinates and sizes

dasguptar commented 9 years ago

I was a bit curious regarding the current status of sliding window based dense multiscale extraction. Any plans to integrate it into Caffe anytime soon?

melgor commented 9 years ago

Since the start of the talks start 7 months ago, is there any progress in that field? Maybe someone implement "Efficient Sliding Window" like in OverFeat?

EvanWeiner commented 9 years ago

Echo @melgor -- any progress on the "Efficient Sliding Window" like in OverFeat in Caffe?

melgor commented 9 years ago

@EvanWeiner, now everything is implemented in Caffe, in the similar fashion like in OverFeat. For running it you need to things:

This two things are merged in Caffe, so you can use it. As a example of output, take a look here: http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/net_surgery.ipynb

I think, that this Issue can be closed.

EvanWeiner commented 9 years ago

@melgor Thank you. But how can I use an output matrix like:

[[282 282 281 281 281 281 277 282] [281 283 283 281 281 281 281 282] [283 283 283 283 283 283 287 282] [283 283 283 281 283 283 283 259] [283 283 283 283 283 283 283 259] [283 283 283 283 283 283 259 259] [283 283 283 283 259 259 259 277] [335 335 283 259 263 263 263 277]]

To locate an object within the photo? These values correspond to the ImageNet classification value. But it seems the output has the same or similiar class in all locations. How to discern a particular object?

melgor commented 9 years ago

@EvanWeiner find more information at Caffe mailing list: https://groups.google.com/forum/#!searchin/caffe-users/Object$20Detection/caffe-users/5TyzPCEjuRs/7sJA0DXhJ-kJ

Here I point, that you could do to detect objects. Read OverFeat paper, here are all the informations.

shelhamer commented 7 years ago

Closing as this is handled by fully convolutional networks and their implementation in Caffe through coordinate mapping, and cropping.