AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Multi modal model #6359

Closed readicculus closed 3 years ago

readicculus commented 4 years ago

Hi @AlexeyAB and anyone else who might be able to offer some insight. I have a dataset of color and thermal images from animal surveys. The thermal images show blobs when animals are present and using hardware alignment we've come quite close to outputting resulting ir-color pairs that are aligned meaning the blob in the thermal image is in the same location of the animal in the color image. Because of some issues in the data acquisition there are many alignments that are off by 40-100px(full images are 4kx6k but I chip into 832x832). I've been trying to improve the alignment but I am starting to think that I should develop a model that is more robust to these small mis-alignments since it seems quite rare to be off by 100px and usually the blob ends up touching or right next to the animal in the color image.

I tried training a 4 channel yolov4 but did not use pretrained weights and stopped it when it was stuck around 30%mAP for 3k iterations(batch=64). For reference on only the color images I can get a mAP of around .8 using yolov3 using pre-trained weights so I should probably try transfer learning with yolov4 but I also think the alignments are a big part of the issue.

My very rough idea, that I would like to hear your thoughts on, is to use a version of either yolov3 or v4 tiny with 3 yolo layers and 6 anchors.

I will use the same convolution/bn structure that is currently in yolo and feed this into the first yolo layer

The first yolo layer will only be fed the single channel IR image, after that upsampling occurs and will keep a residual connection and introduce the 3 channel color. This will then continue down 2 more yolo layers. The first yolo layer(ir only) will use anchors 3,4,5 and the 2nd(concatenated upsample, residual, color image) will also use anchors 3,4,5. Then the final layer will use the smaller anchors 0,1,2.

My hope here is to make a model that can still benefit from the infrared image as they are very useful features, but at the same time not depend on perfect alignment. I think I would also have to reduce the IoU theshold to very low on the first yolo layer for obj.

Anyways hopefully this makes sense... I'm struggling to think of other methods that will allow me to train the aligned images when many of the thermal blobs are only a little bit in the box or sometimes just next to the box. I understand that I could align these with some pre processing to detect blobs but the goal is to run this live in the field and it would simplify an already very complex pipeline if I could train a model to be invariant to this.

readicculus commented 4 years ago

Anchor sizes maybe useful, inputs are 832x832

./darknet detector calc_anchors /data/s3_cache/yboss/datasets/1/yolo.data  -num_of_clusters 6 -final_width 26 -final_heigh 26 -width 832 -height 832 -show
 CUDA-version: 10020 (10020), cuDNN: 7.4.1, GPU count: 2  
 OpenCV version: 3.4.2

 num_of_clusters = 6, width = 832, height = 832 
 read labels from 8356 images 
 loaded      image: 8356     box: 16526
 all loaded. 

 calculating k-means++ ...

 iterations = 60 

counters_per_class = 13904, 2622

 avg IoU = 79.16 % 

Saving anchors to the file: anchors.txt 
anchors =  29, 40,  55, 38,  37, 62,  76, 44,  55, 66,  91, 95
./darknet detector calc_anchors /data/s3_cache/yboss/datasets/1/yolo.data  -num_of_clusters 9 -final_width 26 -final_heigh 26 -width 832 -height 832 -show
 CUDA-version: 10020 (10020), cuDNN: 7.4.1, GPU count: 2  
 OpenCV version: 3.4.2

 num_of_clusters = 9, width = 832, height = 832 
 read labels from 8356 images 
 loaded      image: 8356     box: 16526
 all loaded. 

 calculating k-means++ ...

 iterations = 40 

counters_per_class = 13904, 2622

 avg IoU = 82.51 % 

Saving anchors to the file: anchors.txt 
anchors =  26, 32,  32, 54,  55, 35,  47, 49,  37, 68,  76, 40,  52, 72,  69, 56,  90, 98
velastin commented 3 years ago

Hello. I am curious on how you managed to train a model with 4 channels. I have PNG files with RGBA (so 4 channels), but I just get silly results when trying to train with those on Yolov3. Do you know if YoloV4 fully supports 4 channels and if so how? Thanks

stephanecharette commented 3 years ago

I have PNG files with RGBA (so 4 channels), but I just get silly results when trying to train with those on Yolov3. Do you know if YoloV4 fully supports 4 channels and if so how?

No, YOLOv[1234] does not support 4-channel images. If you have RGBA images, you'll need to flatten them (e.g., mogrify) prior to using them for training or detection.

stephanecharette commented 3 years ago

I have a dataset of color and thermal images from animal surveys. The thermal images show blobs when animals are present and using hardware alignment we've come quite close to outputting resulting ir-color pairs that are aligned meaning the blob in the thermal image is in the same location of the animal in the color image.

Sounds like this would be great when pre-processing the images to write out the bounding boxes for use in training. But if the blobs are less-than-perfect, then expect the trained results to be as good as the training material.

Because of some issues in the data acquisition there are many alignments that are off by 40-100px(full images are 4kx6k but I chip into 832x832).

A 100px offset in a 832x832 image is HUGE. If the animal is 50x50, but the IR shows it being 100px to the side, then training wont work. You'll be training on tree trunks instead of the deer.

Wouldn't it be easier to use the IR to create preliminary bboxes, then go through and manually fix up what needs to be done, and train with the normal 3-channel images? Or are they all off by 40-100px and 100% of them would need to be fixed up? Cause if so, at this point I'd say you have 2 unrelated data sets.

Either way, I'd be curious to see more of the images to better understand. Can you attach some examples to this ticket, or drop into the discord to discuss further: https://discord.gg/zSq8rtW