How can I use DetectNet for custom size data ?

erogol commented 8 years ago

I try to use DetectNet for a third party data with 448x448 size images. What are the parameters need to be changed for this custom problem?

lukeyeager commented 8 years ago

It's certainly possible to adjust DetectNet to work with other image sizes, but not easy. @jbarker-nvidia got it to work with 1024x512 images in his blog post: https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/

Unfortunately, it's not a simple process. And even if you get it to run without errors, you need to understand what's going on pretty well to get it to actually converge to a solution for your data.

Off the top of my head, here are some places to start:

Adjust the image size for your problem (here, here, etc.)
Adjust the stride according to the smallest size object you'd like to detect (here)
If you do change the stride, I think there's another parameter near the end that needs adjusting (this one?) so that the network output size matches the label size

fchouteau commented 8 years ago

I am also trying to adapt detectnet for my own dataset (example 1024x1024 size images) with custom object sizes (around 192x192)

The issue is that in the blog post, the full modified prototxt is not published so I'm having a lot of trouble recalculating what I need to modify:

If I'm correct:

Adjust Image size: In L79-80 and L118-119 { (...) xsize: myImageXSize (or myCropSize if crop) ysize: myImageXSize (or myCropSize if crop) }

Adjust stride for detecting custom classes In L73 { stride: myMinObjectSize } But there I can't understand which parameters I need to tune as 1248 - 352 looks like the original image size but not that much. In L2504 { param_str : '1248, 352, 16, 0.6, 3, 0.02, 22' } I would then guess param_str = 'xSize,ySize,Stride,?,?,?,?' but the rest...

Same for L2519 and L2545

However, I can't understand what I would need to modify: L2418. does not seem to need modification as it is the bounding box regressor so it should output 4 objects. (unless I'm mistaken).

I would love adding documentation to using DetectNet & Digits with a custom dataset, however I can't really understand everything yet.

Regards

jon-barker commented 8 years ago

For 1024x1024 images and target objects around 192x192 you probably don't need to adjust the stride initially. DetectNet with default settings should be sensitive to objects in the range 50-400px. That means that you can just replace the 1248x348/352 everywhere by 1024x1024 and it should "just work".

Something I found that helped accuracy when I modified image sizes was to use random cropping in the "train_transform" - modify the image_size_x and image_size_y parameters to, say, 512 and 512 and set crop_bboxes: false.

szm-R commented 8 years ago

@jbarker-nvidia , Hi I did what you said (set crop_bboxes: false) and it improved my mAP from 1.6 to 14 percent, kindly take a look at my question #1011 , Thank you.

fchouteau commented 8 years ago

@jbarker-nvidia Thank you for your input, much appreciated. I have one more question however: I was also thinking about sampling random crops from image (in my case 512x512) so setting up image_size_x: 512 image_size_y: 512 crop_bboxes: false in detectnet_groundtruth_param however, in the deploy data and later layers, should I specify 1024x1024 or 512x512 as image size ? My guess would be to put 1024x1024 on before the train / val transform and at the end when calculating maP and clustering bboxes however I just watend to be sure.

Regards

jon-barker commented 8 years ago

@fchouteau Set image_size_x: 512 image_size_y: 512 crop_bboxes: false in name: "train_transform", i.e. the type: "DetectNetTransformation" layer applied at training time only. Everywhere else leave the image size as 1024x1024. That way cropping will only be applied at training time and validation and test will use the full-size 1024x1024 images. This works fine because the heart of DetectNet is a fully-convolutional network so can be applied to varying image sizes.

JVR32 commented 8 years ago

Hello everyone,

I want to use DIGITS (detectnet) + CAFFE to detect objects in my own dataset. I read some posts about adapting some settings in detectnet to use it for training and detection in a custom dataset. But apparently, most of the mentioned datasets consists of images with more or less the same dimensions for all images. My case is a bit different from the comments that I found …

I have 3 different object classes which I want to detect in images : classA, classB and classC.

For each object class, I have 3000 training images available (=> so 9000 in total), and 1500 validation images (=> 4500 in total). Those images are ROI’s (regions of interest from other images) that I manually cropped in the past, so the whole (training) image consists of one specific object. The smallest dimension of a training or validation image is always 256 (e.g. 256x256, 256x340, 256x402, 256x280, 340x256, … --> note : not a perfect square, but never a long rectangle like 256 x 1024 or 256 x 800 ; always more or less a square shape). Since all images consist of cropped regions (around an object) from other images, the label files look like this :

108 0.0 0 0.0 0.000000 0.000000 391.000000 255.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108 0.0 0 0.0 0.000000 0.000000 255.000000 459.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108 0.0 0 0.0 0.000000 0.000000 411.000000 255.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 etc.

-> image class = ‘108’ and bounding box of object in the image = image dimensions

I want to train an object detection model(s) so I can detect those 3 objects (if present) in unknown test images, images that were not cropped beforehand. Dimensions of those images can be different (e.g. 800x600, 1200x800, 1486x680, … can be about everything). Remark : in these unknown images -if the object appears in it-, the whole image can consist of the object, or the object can be a smaller part of the image (not covering the whole image).

My first question : is it necessary to make all the training / validation images have the same dimensions (e.g. 256 x 256), or can I solve it by setting some parameters (pad image? resize image?) to a specific dimension while creating a dataset? It’s not clear to me what those parameters exactly imply.

Second question : how about the test images that can have about any dimension ; do I have to resize them before analyzing or not?

If I get it right, I have to make some changes :

A] While creating a dataset, in the DIGITS box, change :

Pad image
Resize image

B] In detectnet_network.prototxt (dim:384 and dim:1248), here.

In the following lines, image_size_x:1248 and image_size_y:384 and crop_bboxes true/false are mentioned : here and here.

And in the following line, dimensions (1248, 352) are also used : here, here and here.

At this moment, it is not clear to me how to set these options for my specific case …

With kind regards.

jon-barker commented 8 years ago

@JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection.

If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet.

If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See here. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset.

JVR32 commented 8 years ago

I guess I got stuck then. I already used the (cropped) training/validation images before (together with a set of negative images that didn't contain the specified objects) to do image classification. => 4 image classes : negative, classA, classB, classC

And that worked quite good, but ... it worked best when the whole image was taken by the object, or if 'most of the image' was taken by the object. If the object was only a smaller part of the image, the image often was considered as 'negative' (->= not an image of the object we look for).

That's why I hoped that using detection instead of classification would improve the results. Main purpose is to detect if a certain object is present in an image (taking the whole image or only a smaller part of the image). Multi-class is not important, it can be done in multiple checks (-> classA present or not? ; classB present or not? ; classC present or not?).

Unfortunately, I manually cropped all the images in the past, so I don't have the crop locations in the original images :-( .

JVR32 commented 8 years ago

Suppose I had done it differently, and I would have 9000 training images and 4500 validation images with dimensions 640x640, and the wanted objects were smaller parts of those images :

e.g.
image1 = 640x640 with object ROI = (10, 10, 200, 250) = (top, left, bottom, right) image2 = 640x640 with object ROI = (200, 10, 400, 370) image3 = 640x640 with object ROI = (150, 150, 400, 400) ...

My test images could still have different dimensions : e.g. 800x600, 1200x800, 1486x680, …

Which settings should I provide while creating a dataset in the DIGITS box :

Pad image?
Resize image?

Are those totally independent from the possible dimensions of the test images (-> leave pad image empty and put 640x640 for resize image) or not?

And what about the dim and _image_sizex and _image_sizey parameters in _detectnetnetwork.prototxt ? Now 384/352 and 1248 are used, but what if the dimensions of the test images can be different, what do I have to put for those parameters?

sherifshehata commented 8 years ago

@jbarker-nvidia I explored the code and it is not clear to me why do you set "crop_bboxes: false".

As i understand the function pruneBboxes() (in detectnet_coverage_rectangle.cpp) adjusts the boxes according to the done transformation. what happens when crop_bboxes is set to false?

JVR32 commented 8 years ago

Hello,

Could you please point me in the right direction before I spend at lot of time on annotating images that cannot be used in later processing?

You told me that I cannot use cropped images, and I can see why ...

But I would like to use object detection in Digits, so I'm willing to start over, and annotate the images again (determine bounding box coordinates around object), but I want to be sure I do it the right way this time.

So, this is my setup :

Suppose I want to detect if a certain object (let's call it classA) is present in an unknown image.

I start with collecting a number of images, e.g. 1000 images that contain objects of classA.

All those images can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; ...

First question : how do I start?

a] Keep the original dimensions, and get the bounding box coordinates for the object of classA in the image ?

b] Resize the images, so they all have comparable dimensions (e.g. resize so the smallest or largest dimension is 640), and after that get the bounding box coordinates for the object of classA in the resized image ?

c] Non of the options above ; all images must have exactly the same dimensions, so resize all images to the same dimensions, and after that get the bounding box coordinates.

Option a] and b] can be done without a problem, c] is not that flexible, so rather not if not necessary.

So, that's the first thing I need to know : how do I start, can I get bounding boxes for the original images, or do I have to resize the images before determing the bounding boxes?

And then the second question : if I follow option a], b] or c] ... I will have 1000 images with for each image the bounding boxes around objects of classA.

After that I'm ready to create the database.

For parameter 'custom classes', I can use 'dontcare,classA'.

But how do I use the 'padding image' and 'resize image'?

I hope you can help me, cause I really want to try to detect objects on my own data, but it's not clear to me how to get started ...

With kind regards,

Johan.

Van: Jon Barker notifications@github.com Verzonden: dinsdag 6 september 2016 14:19 Aan: NVIDIA/DIGITS CC: JVR32; Mention Onderwerp: Re: [NVIDIA/DIGITS] How can I use DetectNet for custom size data ? (#980)

@JVR32https://github.com/JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection.

If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet.

If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See herehttps://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/DIGITS/issues/980#issuecomment-244932886, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATALz186HGgQ2h3imgxbl66Gqoczz2aZks5qnVpvgaJpZM4JlRVq.

jon-barker commented 8 years ago

@JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest.

There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle.

JVR32 commented 8 years ago

Thank you very much for the information.

In that case, I think it is best that I resize all images to have -more or less- the same dimensions before starting to process them.

=> I will resize all images so the smallest dimension is 640.

Then, input images will have dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400.

=> After resizing, I can start annotating, and determine the bounding boxes in the resized images.

Note : I think the bounding boxes don't have to be square?!

And when I'm done annotating, I will have a set of images with the smallest dimension 640 and bounding boxes in those images.

Maintaining the aspect ratio is important, so since I will have resized the images before annotating them, I suppose it is better to use padding (instead of resize) while creating the dataset?

I'll have to use padding if I'm correct, cause all the input images must have the same dimensions, right? So is it correct to leave the 'resize' parameters empty in that case, and put the padding so that all images (with dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ...) will fit in it -> e.g. 800 x 800 ?

Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800?

I guess I have to use at least one of both parameters (padding or resizing), that I cannot just input the images with various dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... without setting one of the 2 mentioned parameters?

Van: Jon Barker notifications@github.com Verzonden: vrijdag 9 september 2016 4:23 Aan: NVIDIA/DIGITS CC: JVR32; Mention Onderwerp: Re: [NVIDIA/DIGITS] How can I use DetectNet for custom size data ? (#980)

@JVR32https://github.com/JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest.

There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/DIGITS/issues/980#issuecomment-245800374, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATALz5FrlZjG_bTSn223B30HYGrF4Mgpks5qoMMLgaJpZM4JlRVq.

jon-barker commented 8 years ago

@JVR32

I think the bounding boxes don't have to be square?!

Correct

Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800?

You don't need to do this, you can just pad up to the standard size. If all of your images are already 640 in one dimension then I would must pad the other dimension to the size of your largest image in that dimension. That minimizes further unnecessary manipulation of the data.

fdesmedt commented 8 years ago

I am trying to train a model for pedestrians (using TownCentre annotations) based on the KITTI example for cars. First I tried using the original resolutions (1920x1080), but changing the network parameters according to the comments above (replacing 1248x348/352 with the new resolution) lead to the error "bottom[i]->shape == bottom[0]->shape" which I was not able to solve.

To avoid having to change the network parameters, I just rescaled all training images ( and annotations accordingly) to the same resolution as KITTI, but the accuracy remains very low (also after processing 350 epochs). When I tried the advice of using cropping from the images, I fall back to the same error message about the shape.

Is there some other example available for object detection with different resolution input that reaches acceptable results?

sherifshehata commented 8 years ago

which layer give this error? My guess it is because your resolution is not divisible by 16, so you need to set your should replace 1248x348/352 with 1920x1080/1088

fdesmedt commented 8 years ago

The problem is always on "bboxes-masked"

I will try your suggestion. The original size in the network is actually 1248x384 (copied the values from above, which turned out to be incorrect). The 384 value is however divisible by 16, so what is the reason for using 352 there?

Another question. Is it important to scamble the data? The training data I have are the images of a long sequence, so consecutive frames contain a lot of the same pedestrians. Is this data scrambled before training? Or should I do this myself?

fdesmedt commented 8 years ago

I have tried your suggestion, but still get an error on the shape-issue. I attach the resulting log-file: caffe_output.txt

sherifshehata commented 8 years ago

Did you do any other changes? your bboxes is 3 4 67 120, while i think it should be 3 4 68 120

fdesmedt commented 8 years ago

I did not change anything else, just replaced all instanced of 1248 by 1920, 384 by 1080 and 352 by 1088. Does the last one make sense?

It seems indeed that the 67 is the problem. I think it comes from the 1080 size, which is pooled 4 times (leading to dimention 540, 270, 135 and 67 of which the last one is truncated). I am now recreating the dataset with padding to 1088 to avoid the truncation. Hopes this helps ;)

JVR32 commented 8 years ago

Hello,

I trained a detection network as follows :

All training images (containing the objects I want to detect) can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; 5125x3480 ; ... Before annotating (determining the bounding boxes around the objects in the images -> needed for KITTI format), I resized all those images so the largest dimension is 640. Then, input images will have dimensions 640x640 ; 640x400 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400. After resizing, I can start annotating, and determine the bounding boxes in the resized images. And when I'm done annotating, I have a set of images with the largest dimension 640 and bounding boxes around the objects of interest in those images.

I use those resized images and the bounding boxes around the objects for building a dataset, using this settings : So I padded the images to 640 x 640.

In 'detectnet_network.prototxt', I replaced 384/352 and 1248 by 640.

After training, I want to test the network.

The images I want to test can also have different dimensions. I can resize those images so the largest dimension is 640, but I don't know if that is necessary? And since the images can have different dimensions, it seems logical to me to put the Do not resize input image(s) flag to TRUE?

I created a text file with the paths to the images I would like to test. If I use this file for 'test many', it will generate some results if the Do not resize input image(s) is not set. If I set this flag to TRUE, it generates an error :

Couldn't import dot_parser, loading of dot files will not be possible. 2016-09-30 10:13:21 [ERROR] ValueError: could not broadcast input array from shape (3,480,640) into shape (3,640,640) Traceback (most recent call last): File "C:\Programs\DIGITS-master\tools\inference.py", line 293, in args['resize'] File "C:\Programs\DIGITS-master\tools\inference.py", line 167, in infer resize=resize) File "C:\Programs\DIGITS-master\digits\model\tasks\caffe_train.py", line 1394, in infer_many resize=resize) File "C:\Programs\DIGITS-master\digits\model\tasks\caffe_train.py", line 1434, in infer_many_images 'data', image) ValueError: could not broadcast input array from shape (3,480,640) into shape (3,640,640)

What I don't understand : if I put only 1 file in the images list (and press test many), there is no error. If I put multiple files in the list, I get the error. But only if the 'do not resize' flag is checked ; if not checked -> no error?

Is this a bug, or is there a logical explanation? Anyhow, I guess it must be possible to process a list of (test)images without resizing them before object detection? If it works for a single image in the list, it should also be possible for multiple images?

gheinrich commented 8 years ago

Hi @JVR32 thanks for the detailed report! This is a bug indeed, sorry about that. I agree the error is certainly not explicit! We have a Github issue for this: #1092 In short the explanation is: when you test a batch of images, they must all have the same size otherwise you can't fit them all into a tensor.

JVR32 commented 8 years ago

Hello,

I trained a network for object detection -> setup was described in my previous post (2 posts above this). As you can see, I resized (before annotating) all the training and validation images so the largest dimension is 640 pixels, but the other dimension isn't always the same, it can vary -> 640x480;640x402;640x380;312x640...

While building the dataset, I set the option to pad the images to 640 x 640 (I didn't touch the 'resize' option).

2 questions :

A] Since the test images can also have different dimensions, I though I should check the flag 'do not resize input images', especially since I didn't use the 'resize' while creating the dataset. But somehow the detection seems better if the 'do not resize input images'-flag is unchecked -> although the aspect ratio changes -> test image of 640 x 480 becomes an image of 640 x 640. Is this logical?

B] For the training, I used the following settings :

train_settings

Plotting the precision for a different number of (training)images : plot

As you can see, increasing the number of images improves the precision, but at a certain point, increasing the number of images produces a worse precision (red curve). My question : does anyone has a suggestion on what to try first to improve the model -> other solver type, other learning rate, other parameter value ... what can I do first?

varunvv commented 7 years ago

This is my first experiment with DETECTNET I build dataset as specified as in the link https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md. The resulting dataset properties are given below

DB backend: lmdb
Create train_db DB
    Entry Count: 645
    Feature shape (3, 800, 1360)
    Label shape (1, 7, 16)
Create val_db DB
    Entry Count: 96
    Feature shape (3, 800, 1360)
    Label shape (1, 5, 16)

I tried to train the model with the above dataset. The configurations are done as specified in the above link.

screenshot from 2016-11-28 18 40 29

the initial part of the .prototxt file is given below

name: "DetectNet" layer { name: "train_data" type: "Data" top: "data" data_param { batch_size: 4 } include: { phase: TRAIN } } layer { name: "train_label" type: "Data" top: "label" data_param { batch_size:4 } include: { phase: TRAIN } } layer { name: "val_data" type: "Data" top: "data" data_param { batch_size: 4 } include: { phase: TEST stage: "val" } } layer { name: "val_label" type: "Data" top: "label" data_param { batch_size:4 } include: { phase: TEST stage: "val" } } layer { name: "deploy_data" type: "Input" top: "data" input_param { shape { dim: 1 dim: 3 dim: 800 dim: 1360 } } include: { phase: TEST not_stage: "val" } }

layer { name: "train_transform" type: "DetectNetTransformation" bottom: "data" bottom: "label" top: "transformed_data" top: "transformed_label" detectnet_groundtruth_param: { stride: 10 scale_cvg: 0.4 gridbox_type: GRIDBOX_MIN coverage_type: RECTANGULAR min_cvg_len: 20 obj_norm: true image_size_x: 1360 image_size_y: 800 crop_bboxes: true object_class: { src: 1 dst: 0} # obj class 1 -> cvg index 0 } detectnet_augmentation_param: { crop_prob: 1 shift_x: 32 shift_y: 32 flip_prob: 0.5 rotation_prob: 0 max_rotate_degree: 5 scale_prob: 0.4 scale_min: 0.8 scale_max: 1.2 hue_rotation_prob: 0.8 hue_rotation: 30 desaturation_prob: 0.8 desaturation_max: 0.8 } transform_param: { mean_value: 127 } include: { phase: TRAIN } } layer { name: "val_transform" type: "DetectNetTransformation" bottom: "data" bottom: "label" top: "transformed_data" top: "transformed_label" detectnet_groundtruth_param: { stride: 10 scale_cvg: 0.4 gridbox_type: GRIDBOX_MIN coverage_type: RECTANGULAR min_cvg_len: 20 obj_norm: true image_size_x: 1360 image_size_y: 800 crop_bboxes: false object_class: { src: 1 dst: 0} # obj class 1 -> cvg index 0 } transform_param: { mean_value: 127 } include: { phase: TEST stage: "val" } } layer { name: "deploy_transform" type: "Power" bottom: "data" top: "transformed_data" power_param { shift: -127 } include: { phase: TEST not_stage: "val" } }

layer { name: "slice-label" type: "Slice" bottom: "transformed_label" top: "foreground-label" top: "bbox-label" top: "size-label" top: "obj-label" top: "coverage-label" slice_param { slice_dim: 1 slice_point: 1 slice_point: 5 slice_point: 7 slice_point: 8 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "coverage-block" type: "Concat" bottom: "foreground-label" bottom: "foreground-label" bottom: "foreground-label" bottom: "foreground-label" top: "coverage-block" concat_param { concat_dim: 1 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "size-block" type: "Concat" bottom: "size-label" bottom: "size-label" top: "size-block" concat_param { concat_dim: 1 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "obj-block" type: "Concat" bottom: "obj-label" bottom: "obj-label" bottom: "obj-label" bottom: "obj-label" top: "obj-block" concat_param { concat_dim: 1 } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "bb-label-norm" type: "Eltwise" bottom: "bbox-label" bottom: "size-block" top: "bbox-label-norm" eltwise_param { operation: PROD } include { phase: TRAIN } include { phase: TEST stage: "val" } } layer { name: "bb-obj-norm" type: "Eltwise" bottom: "bbox-label-norm" bottom: "obj-block" top: "bbox-obj-label-norm" eltwiseparam { operation: PROD } include { phase: TRAIN } include { phase: TEST stage: "val" } }

While training I am getting an error

"ERROR: Check failed: bottom[i]->shape() == bottom[0]->shape()" details Creating layer coverage/sig Creating Layer coverage/sig coverage/sig <- cvg/classifier coverage/sig -> coverage Setting up coverage/sig Top shape: 3 1 50 85 (12750) Memory required for data: 4344944304 Creating layer bbox/regressor Creating Layer bbox/regressor bbox/regressor <- pool5/drop_s1_pool5/drop_s1_0_split_1 bbox/regressor -> bboxes Setting up bbox/regressor Top shape: 3 4 50 85 (51000) Memory required for data: 4345148304 Creating layer bbox_mask Creating Layer bbox_mask bbox_mask <- bboxes bbox_mask <- coverage-block bbox_mask -> bboxes-masked Check failed: bottom[i]->shape() == bottom[0]->shape()

I changed batch size, stide size etc but nothing helped. What should I do?

Regards,

varunvv commented 7 years ago

Apologies.. I made a mistake. The name used in label (.txt) file and Custom classes were different.

regards,

ShervinAr commented 7 years ago

@lukeyeager Hello, is there a complete explanation on exactly which parameters one needs to adjust to train DetectNet on custom sized data where the stride parameter also requires modification? So far I have been able to change the image sizes in the prototxt file and the training started without any errors but changing the stride parameter looks quite challenging; changing it results in various errors ...

aprentis commented 7 years ago

Hi, everybody!

i`ve clonned this repo https://github.com/skyzhao3q/NvidiaDigitsObjDetect

and done everything as it mentioned in Readme(made dataset for Object Detection, run network). But everything i`ve got was this.

screenshot from 2017-03-29 19-32-08

What`s the problem with one class? Full kitti goes ok.

jon-barker commented 7 years ago

@aprentis Can you hover over the graph so that we can the actual numeric results for the metric - it matters greatly whether those numbers are just small or exactly zero?

Looking at the repo you cloned I noticed that the model has explicit "dontcare" regions marked - whilst this can be useful, e.g. for masking out the sky when you only care about the road it is not necessary. I'm not sure what regions are being marked as "dontcare" for this data, but if it includes the sidewalks where the pedestrians are then you're going to have problems.

aprentis commented 7 years ago

@jbarker-nvidia those numbers are exactly zero. Right now i`m training another model(with one class in it), unfortanately it has same problem.

In this repo i`ve found result screenshot, which says that mAP is OK after 10 epochs. Does anybody know any successfull story about training DetectNet with only one class?

aprentis commented 7 years ago

@jbarker-nvidia I`ve tried the network which @gheinrich published(for two classes), but mAP was still zero.

jon-barker commented 7 years ago

@aprentis The fluctuations in those graphs suggest that not all of those numbers are exactly zero. Can you specify which ones are zero and which ones are not?

The basic Kitti example is for one class - just cars. There are lots of examples of successfully training DetectNet on other one class problems too.

aprentis commented 7 years ago

@jbarker-nvidia

Here are zoomed graphs. 2017-03-30_16-56-54

jon-barker commented 7 years ago

@aprentis Thanks - which optimizer, learning rate and learning rate decay policy are you using? I think you may want to try a smaller learning rate and/or more aggressive learning rate decay - I like to use Adam and an exponential learning rate decay.

aprentis commented 7 years ago

@jbarker-nvidia Ive run the network step by step. Whats wrong?

Here I used ADAM with Expo decay with 0.95 Gamma.

Does it depend on my CPU\GPU configuration?

2017-03-30_17-17-03 2017-03-30_17-17-22 2017-03-30_17-33-07 2017-03-30_17-35-15

jon-barker commented 7 years ago

@aprentis "Does it depend on my CPU\GPU configuration?" - No, that shouldn't matter unless you are using multiple GPUs in which case you may need to adjust your learning rate to accomodate the larger effective batch size.

From the information you've posted it appears that you have a correctly configured dataset and model definition. You may want to try a more aggressive learning rate decay schedule, say exponential with 0.99 decay.

aprentis commented 7 years ago

Now with 0.99. I dont understand whats wrong with it. =(

2017-03-31_11-22-12

AleiLmy commented 7 years ago

@jbarker-nvidia Hi,I saw your article https://devblogs.nvidia.com/parallelforall/exploring-spacenet-dataset-using-digits/，I have some questions to ask you about it.When you train the net,what your data format looks like,How do you convert the format of spaceNet into a data format that can use detectnet.thank you!

jon-barker commented 7 years ago

@AleiLmy For the object detection/DetectNet approach the data follows the standard Kitti format for bounding boxes. We used Python scripts to convert the geoJSON files to Kitti format text files. Obviously the building footprints are not all rectangular and don't all have sides parallel to the input image, so we used the minimum enclosing rectangle with those properties.

For the segmentation approach we again used Python scripts to convert the geoJSON files to .PNG files for the segmentation masks.

ontheway16 commented 7 years ago

@jbarker-nvidia Hello, for my project I am trying to detect small objects (30-60 pixels). Detectnet is making detections very nicely, mAP is about 65, test accuracy is over 90% no problems here. The only thing I cannot figure out how to solve is detection of nearby objects. If there are two objects with 5-10 pixels distance, detectnet fails to distinguish and gives them a single bounding box. And I guess detecting overlapping objects individually is totally impossible.

I changed stride to 8 but not much help for this problem. May visualizations help me here to detect where the problem actually starts across network? Can you advise some modification points in network for this purpose?

AleiLmy commented 7 years ago

@jbarker-nvidia my labels look like this building 0.0 0 0.0 325 68 358 104 0 0 0 0 0 0 0 (the labels of image contains buildings) dontcare 0.0 0 0.0 0 0 50 50 0 0 0 0 0 0 0 (the labels of image doesn't contain buildings)

resized the images to 1280x1280,the map is zero. index

and i tried to follow you steps ,but got a really bad output like this bbox-list [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] where did i go wrong?

bfreskura commented 7 years ago

@ontheway16 Can you please explain what have you changed to make it work with stride=8?

ontheway16 commented 7 years ago

@Barty777 please check the following discussion;

https://groups.google.com/forum/m/#!topic/digits-users/zx_UYu3jlt8

sam-pochyly commented 7 years ago

@varunvv Can you clarify what you fixed? I'm getting the same error as you had.

juiceboxjoe commented 6 years ago

Thank you all for your ongoing support. I am currently training DetectNet on aerial images. Recall is at about 47% so far and precision at about 23% (epoch 800). I'm using base learning rate of 1e-4 with exp decay and gamma .99 on one GPU. After 350 epochs I stopped training to restart using two GPUs (set base learning rate to the last learning rate used on previous run to continue training). So far mAP, precision, and recall are still increasing (slowly).

I have a question about one of @jbarker-nvidia 's previous comments in this thread about using multiple GPUs:

@jbarker-nvidia "@aprentis "Does it depend on my CPU\GPU configuration?" - No, that shouldn't matter unless you are using multiple GPUs in which case you may need to adjust your learning rate to accomodate the larger effective batch size."

For the sake of clarity:

I know that when using (for example) 1 GPU with batch size of 5 and batch accumulation of 2 I get an effective batch size of 10 and I should adjust my learning rate accordingly, but do you mean that using additional GPUs would also change the effective batch size? For example if I'm using 2 GPUs with those same parameters (batch size 5 and accumulation 2) the effective batch size is actually going to be 20 instead of 10?

I would naturally think that backprop would take place every time 2 batches of 5 images have been processed regardless of which GPU was used to process each batch, and not every time all the GPUs in use have processed 2 batches of 5 images each. Meaning that my learning rate adjustments should only be based on batch size + batch accumulation, whilst totally disregarding the amount of GPUs in use.

I'm very confused as to what you mean by adjusting the learning rate to accommodate effective batch size when using multiple GPUs.

So my question is why is learning rate affected by amount of GPUs in use?

Thank you in advance for your help.

jon-barker commented 6 years ago

@juiceboxjoe You are correct that backprop would take place on each GPU independently after 2 batches of 5 images have been processed. But the learning rate is not used in backprop, it used in the optimizer. The optimizer aggregates the gradients from all of the GPUs, performs a gradient descent update and then broadcasts the new parameters back to the GPUs. So that aggregation step across the GPUs effectively increases the batch size to 20 (in the case of 2 GPUs).

juiceboxjoe commented 6 years ago

@jbarker-nvidia Thank you so much for your quick and enlightening response! Sorry for my confusion. I also found some nice, quick reference here that further clarifies your point about the difference between backprop and optimizers.

This means that I was wrong to think that my effective batch size did not change when I paused training on one GPU and then continued on two GPUs.

Can you also point me in the right direction regarding learning rate adjustments according to effective batch size when training DetectNet (found through experimentation - like DetectNet's ideal 40-500px detection range)?

Thanks in advance.

jon-barker commented 6 years ago

@juiceboxjoe There's not really any solid rule for how much to adjust the learning rate by as a function of batch size. But if you do need to change it you will need to decrease the learning rate as the batch size grows.

rsandler00 commented 6 years ago

This seems relevant to post here:

When I naively changed the sizes in the detectNet prototxt file to my current image size, I got the error: "Check failed: bottom[i]->shape() == bottom[0]->shape()"

This is because the image dimensions have to be an integer multiple of the stride. So w/ stride 16, I resized my 1920x1080 image to 1920x1088 and the error was resolved

cesare-montresor commented 6 years ago

@jon-e-barker Same issues here, but I'm using a small custom dataset ( 500 images coming from the same video ) could be this the problem of the mAP = 0 (as well as every other metric) ? I've tried adjusting sizes in the model, padding, resizing, etc. The 12 Gb kitti dataset traing with no issues.

NVIDIA / DIGITS

How can I use DetectNet for custom size data ? #980