Detect on many varied-size people

chungkaihsieh commented 5 years ago

@AlexeyAB Hi,

Thanks for your help in advance. I attempt to detect the varied-size people(only people) by yolov3-tiny.cfg (608x608). The number of people per image perhaps range from 1 to 100. The images contain selfie (large-size people), crowd people (small-size people), and the selfie with the crowd people (both large-size and small-size people). I have followed instruction also recalculated anchors and found that the models can perform well on small-size people. However, the large-size people can't be detected.

I have tried some pre-train models to test on data and finally determine perhaps yolov3-tiny is what I want.

yolov3.cfg -> detection performance is good enough for me but speed is not enough for my case.
yolov2-tiny-voc.cfg -> can perform well on large-size people, but on the aspect of small-size people is not good and FP is too big even I have added negative samples.
yolov3-tiny.cfg -> perform poor on large-size people comparing to small-size people.

May you give me some pieces of advice to detect varied-size people.

Thanks for your time and consideration. CK Hsieh

Deadmin1 commented 5 years ago

Did you tried the yolov3-tiny_3l? Its a new config that was recently added. It has 1 more YOLO-Layer. Should be better in performing but a lil bit slower.

AlexeyAB commented 5 years ago

@chungkaihsieh Hi,

Try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-tiny_3l.cfg with recalculated anchors: ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 608 -height 608

chungkaihsieh commented 5 years ago

Thank you, @AlexeyAB @Deadmin1!

I have tried this new yolov3-tiny_3l.cfg yesterday. Good news is that some big size bounding boxes emerge comparing to 2 layers. However, the result is not good enough even I take training images for testing. e.g. There is 5 people (upper body) selfie, only 2 out of 5 been detected.

Below is setting and status:

Train on yolov3-tiny_3l.cfg and currently is 12000 steps.
Already recalculated anchors.
Total images: 30000; people images: negative images (50:1).

Pros & Cons of current weights:

Perform well on small-size people or people with the whole body in images.
Perform poor on large-size people with only upper body.

I would like to enhance detection on parts of people with only upper body in images. May you kindly give some tips for detection? Thanks a lot for your help : )

AlexeyAB commented 5 years ago

@chungkaihsieh

Check that you have enough images with people with only upper body in your Training dataset.

Try to use yolov3-tiny.cfg and change these lines: https://github.com/AlexeyAB/darknet/blob/fd0df9297c86a272f0bf0841291bc4565e90a7cd/cfg/yolov3-tiny.cfg#L107-L121

to these lines - and train from the beginin:

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=1024
 size=3 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=1024
 size=3 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=1024
 size=3 
 stride=1 
 pad=1 
 activation=leaky

Or try to use yolov3-tiny_3l.cfg and change these lines: https://github.com/AlexeyAB/darknet/blob/fd0df9297c86a272f0bf0841291bc4565e90a7cd/cfg/yolov3-tiny_3l.cfg#L108-L122

to these lines - and train from the beginin:

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=1024
 size=3 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=1024
 size=3 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=1024
 size=3 
 stride=1 
 pad=1 
 activation=leaky

chungkaihsieh commented 5 years ago

Hi AlexeyAB, Thanks again for your help. I will check dataset and try these configurations. Also, update the results for further discussion. :目

chungkaihsieh commented 5 years ago

@AlexeyAB Hi,

Sorry for the late update and thanks again for your kind help. I would like to show you some results and ask some questions. Good news is that after follow your suggestions: training from beginin with more conv layers can detect varied-size people.💯But there is some drawback comparing to pre-train weights. (2 layers feature map)

FLOPS increase ~2x
AP not really increase
False Positive increase (Some chairs may detect as people)

Goals

Detect large-size people with seldom failure.
Detect small-size people as many as possible.
Perhaps run in realtime without GPU.

Results

YOLOv3-tiny-2layers(with pre-train weights) a. FLOPS 5.56 Bn b. AP 44.09% c. can only detect small-size people.
YOLOv3-tiny-3layers(with pre-train weights) a. FLOPS 15.165 Bn b. AP 52.54% c. can only detect small-size people.
YOLOv3-tiny-2layers (Without pre-train weights & added more convolutional filters) a. FLOPS 10.291 Bn b. AP = 42.21% with high FP. (50000 steps) c. can detect varied-size people but some large-size people still fail, detect chairs as people.
YOLOv3-tiny-3layers (Without pre-train weights & added more convolutional filters) a. FLOPS 25.510 Bn b. AP = 52.91 % with lower FP (24000 steps) c. Recall is very low, fail a lot of large-size people.

Questions

Are there somehow methods to fine-tune pre-train models to reach my goals?
How can I reduce FLOPS on results 3 and still keep this performance?
What ratio of the negative sample is reasonable for fine-tune and train from beginin, respectively. Does it matter? Since I found that if the ratio of negative samples is too large, the recall will decrease dramatically.
In my case, how many steps would you recommend for fine-tune and train from begining?
Any other suggestions will be appreciated.

Thanks a lot for your time and kindness. CK Hsieh

AlexeyAB commented 5 years ago

@chungkaihsieh Hi,

As I understand, the is the best cfg-file for you, but you want to reduce BFlops.

YOLOv3-tiny-2layers (Without pre-train weights & added more convolutional filters) a. FLOPS 10.291 Bn b. AP = 42.21% with high FP. (50000 steps) c. can detect varied-size people but some large-size people still fail, detect chairs as people.

So use the same number of convolutional layers, but use 2x less filters:

Those, use yolov3-tiny.cfg and change these lines: https://github.com/AlexeyAB/darknet/blob/fd0df9297c86a272f0bf0841291bc4565e90a7cd/cfg/yolov3-tiny.cfg#L107-L121

to these lines - and train from the beginin:

 [convolutional] 
 batch_normalize=1 
 filters=256
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=3 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=256
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=3 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=256
 size=1 
 stride=1 
 pad=1 
 activation=leaky 

 [convolutional] 
 batch_normalize=1 
 filters=512
 size=3 
 stride=1 
 pad=1 
 activation=leaky

What ratio of the negative sample is reasonable for fine-tune and train from beginin, respectively. Does it matter? Since I found that if the ratio of negative samples is too large, the recall will decrease dramatically.

Usually 1:1. But it depends on what is more important for you:

if you want to decrease Flase-Positives - then use more negative-samples (images with backgrounds without objects)
if you want to decrease False-Negatives - then use more images with objects than images with backgrounds

In my case, how many steps would you recommend for fine-tune and train from begining?

You should train until increasing of mAP will stop: https://github.com/AlexeyAB/darknet#when-should-i-stop-training

umbralada commented 5 years ago

@AlexeyAB Hi,

Thank you for your research and network improvement.

I work with your network yolov3-tiny_3l.cfg. Its amazing! However, the average loss at all 200,000 iterations was ~0.8 (the anchors were recalculated). For my 22 classes, most likely this is not enough iterations. Is it possible to improve network performance by changing the learning schedule? What values would you recommend?

Thanks so much.

AlexeyAB commented 5 years ago

@umbralada Hi,

I work with your network yolov3-tiny_3l.cfg. Its amazing! However, the average loss at all 200,000 iterations was ~0.8 (the anchors were recalculated). For my 22 classes, most likely this is not enough iterations. Is it possible to improve network performance by changing the learning schedule? What values would you recommend?

The more layers - the higher accuracy mAP but also the higher Loss. So doesn't worry about Loss, try to check mAP.

umbralada commented 5 years ago

Thank you, @AlexeyAB

AlexeyAB / darknet