Open AlasterMeehan opened 4 years ago
Ideally we would like to detect objects of interest as soon as they appear on our rolling data rather than waiting for for 416 new pixels, but doing so involves reprocessing the same data several times and is far from optimal.
I think this is the easiest, fastest and the most accurate way to do this.
Just for example, receptive field of final activation in yolov4.cfg is ~1500x1500, so appearing new part of image on the edge will affect on all final activations, and on the most of inermediate features, so you should recalculate all feature layers to get good accuracy. You can try to optimize it, but I think it will be very difficult.
Thanks for your quick reply. I have been doing some further reading into receptive fields.
There are a few things I would like to understand better. It seems valid to take trained weights from a model trained with a small image (eg 416x416) then import these weights into a larger model with an image size defined in the .cfg file (eg 2080x2080). How/why is this possible? I would have thought there are missing weight values? If the receptive field is ~1500x1500 then should we be training models with bigger input images?
How/why is this possible? I would have thought there are missing weight values?
weights-size depends on kernel_size, input channels, number of filters, and conv-groups. it doesn't depend on width,height of layer.
If the receptive field is ~1500x1500 then should we be training models with bigger input images?
https://arxiv.org/pdf/2004.10934.pdf
The influence of the receptive field with different sizes is summarized as follows: • Up to the object size - allows viewing the entire object • Up to network size - allows viewing the context around the object • Exceeding the network size - increases the number of connections between the image point and the final activation
Set in cfg-file
[net]
show_receptive_field=1
and run detection - you will see receptive field for each layer.
Once the network is loaded onto a GPU it takes much more space in memory than the saved weights used to build it. The required memory usage also increases based on the network/image sizes used. Why would the model take up so much more memory than the weights that it is built from? Is this optimization for improving processing times?
The amount of memory on the GPU limits the image sizes that can YOLO can process, are there any ways to limit the amount of memory a model will use for large images/networks? It would be nice to be able to feed very large images into YOLO without having to use a tile/window approach.
I should also say that I have only tested on YOLOv3 up until now. Are there any major ways that YOLOv4 differs in parameters discussed on this thread?
Why would the model take up so much more memory than the weights that it is built from?
No. weights-size doesn't depend on network resolution. output-layer-size depends on network resolution.
The amount of memory on the GPU limits the image sizes that can YOLO can process, are there any ways to limit the amount of memory a model will use for large images/networks?
No.
@AlasterMeehan @AlexeyAB , Hello!
I am working on the same issue you have. It seems you were able to solve it. I'm trying to take very big images and process them by tiling them. Could you please show me how you did it or what code you changed? I would be incredibly thankful. It sounds like your using it for video but i only need it for big images. Thank you!
The image below shows the GPU memory usage from loading 2 different YOLO models. The .weights files are the same and are 235MB. Only thing changed is in the .cfg file where width is 416 in the first model and 2080 for the second model. Height is the same in both at 416. First model uses ~2.3GB and second uses ~5.1GB.
I am using the C++ dll interface. (its acuually wrapped in a C# and then used in Matlab but essentially its the C++ interface included here).
I have assuming that you need a network size to match the image size that you are using. (If they are different will input images be resized?) Shrinking our images or reducing output layer resolution would reduce accuracy, and would not be practical when image sizes are in the 10000's of pixels.
Can this GPU memory be reduced so large network sizes can be used?
@Mogarbobac I prepossessed and tiled the images in Matlab. It is best that you overlap these tiles with the overlap being at lease as big as the objects that you want to detect. Objects detected are then mapped back to the original image. This can result in multiple detections of the same object, this can be reduced with post processing using a non max suppression algorithms. I would share my code but it is currently heavily customize and written in Matlab. At some point I would like to generalize our code and share the Matlab wrapper once I find the time.
Can this GPU memory be reduced so large network sizes can be used?
No.
(If they are different will input images be resized?)
Yes.
https://github.com/AlexeyAB/darknet#how-to-improve-object-detection
General rule - your training dataset should include such a set of relative sizes of objects that you want to detect:
train_network_width train_obj_width / train_image_width ~= detection_network_width detection_obj_width / detection_image_width train_network_height train_obj_height / train_image_height ~= detection_network_height detection_obj_height / detection_image_height I.e. for each object from Test dataset there must be at least 1 object in the Training dataset with the same class_id and about the same relative size:
object width in percent from Training dataset ~= object width in percent from Test dataset
That is, if only objects that occupied 80-90% of the image were present in the training set, then the trained network will not be able to detect objects that occupy 1-10% of the image.
@AlasterMeehan When you said "I prepossessed and tiled the images in Matlab" Did you save the individual tiles as a new image in a folder and then run them as a batch? or did you feed it directly into the ram one at a time? I may be able to figure out the first one but not the Ram one. Thank you!
@Mogarbobac For training I save the tiles as images, when doing detection I use the dll interface, you can read more about the interface here.
https://github.com/AlexeyAB/darknet#how-to-use-yolo-as-dll-and-so-libraries There are custom image structures that you you can write data to and call the Yolo detect function directly. If detection performance is not critical then you can just safe as images. If you are using python then there is a wrapper already made for you, otherwise there is a bit of a learning curve to use the C/C++ interfaces.
@AlasterMeehan I'm having alot of problems with the dll setup. Are there any guides that maybe show something close to what i'd like to do? If you'd like to PM me instead my email is mogarbobac@aim.com.
Any assistance would be greatly appreciated. Thank you.
I have an application where we have a potentially very large image (4000 to 40000) pixels wide with new rows of pixels coming in at 10-20 per second.
Model is trained using overlapping tiles 416 x 416, overlap is 100 to 200 pixels. The training data is usually small snippets of the whole sensor in the hundreds of pixels.
For detection we can use wider tiles/windows that are larger 416 x 2080 so there is less overlap and data is not processed multiple times. Surprisingly this seems to work well.
Ideally we would like to detect objects of interest as soon as they appear on our rolling data rather than waiting for for 416 new pixels, but doing so involves reprocessing the same data several times and is far from optimal.
Would it be possible to process feature layers of the network, save these values in a buffer then run the Yolo detection layers every time a new block of data comes in? Would this require much change to to code?