AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.69k stars 7.96k forks source link

Fluctuating mAP for custom dataset ! #2961

Open aditbhrgv opened 5 years ago

aditbhrgv commented 5 years ago

chart

Hi all,

I am trying to use yolov3-tiny_3l.cfg for my custom dataset with 2 classes. I changed my .cfg for classes, filters and also obj.data files. I generated anchors for my custom dataset and put it into .cfg file.

no_train_images = 5400

no_test_images = 1200

I can see the loss going down, but the mAP fluctuates very much.(see graph with mAP) How can I solve this problem?Any suggestions..? THanks

AlexeyAB commented 5 years ago

@aditbhrgv Hi,

aditbhrgv commented 5 years ago

@AlexeyAB Thanks for your reply !

  1. 2 classes
  2. The training and validation datasets are separate. There are no intersections between them
  3. Attached is .cfg file !

    [net]

    Testing

    batch=1

    subdivisions=1

    Training

    batch=64 subdivisions=32 width=608 height=608 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.0005 burn_in=2000 max_batches = 35000 policy=steps steps=360000,380000 scales=.1,.1

[convolutional] batch_normalize=1 filters=16 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=1

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

###########

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=21 activation=linear

[yolo] mask = 6,7,8 anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42 classes=2 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1

[route] layers = -4

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[upsample] stride=2

[route] layers = -1, 8

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=21 activation=linear

[yolo] mask = 3,4,5 anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42

classes=2 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1

[route] layers = -3

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[upsample] stride=2

[route] layers = -1, 6

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=21 activation=linear

[yolo] mask = 0,1,2 anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42 classes=2 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1

AlexeyAB commented 5 years ago

The training and validation datasets are separate. There are no intersections between them

Did you divide it uniform randomly or not?

Did you check your dataset by using Yolo_mark?

Can you show cloud.png image after this command? ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 608 -height 608 -show

aditbhrgv commented 5 years ago

No I checked using Yolo mark, it shows correct BB on images. Attached is the cloud.png cloud

AlexeyAB commented 5 years ago

@aditbhrgv

Try to train by using these mask and filters from the begining

filters=7

[yolo]
mask = 8
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
.....

filters=14

[yolo]
mask = 6,7
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
...

filters=42

[yolo]
mask = 0,1,2,3,4,5
anchors = 8, 10,   11, 12,   14, 11,   18, 14,   25, 15,   36, 18,   49, 23,   71, 25,   93, 42
aditbhrgv commented 5 years ago

THanks ! I'll try that.. Can you please tell me the reasoning behind doing this ? WOuld be really helpful! Thanks

AlexeyAB commented 5 years ago

After training - show your Loss & mAP chart

https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so that 1st-[yolo]-layer has anchors larger than 60x60, 2nd larger than 30x30, 3rd remaining. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

aditbhrgv commented 5 years ago

@AlexeyAB Can you please let me know the possible reasons for this fluctuating mAP? I have currently set random=0 in .cfg file and started training. This lead to less fluctuating behavior(than previous attached graph).
I have started training with changed anchors you decribed before and share the results once its done. Also, could you please give me the a bit more interpretation of cloud.png ? And , I tried to train same dataset on Pytorch implemetation and my mAP got converged after 23 epochs. My inital LR was 0.01 and decreased by 10 after 20,50,100 epochs. Can I set the same LR schedule iin .cfg file ? Thanks

AlexeyAB commented 5 years ago

Can you please let me know the possible reasons for this fluctuating mAP?

There can be many reasons.

And , I tried to train same dataset on Pytorch implemetation and my mAP got converged after 23 epochs. My inital LR was 0.01 and decreased by 10 after 20,50,100 epochs. Can I set the same LR schedule iin .cfg file ?

If you have 5400 training images and set batch=64, then epoch = 5400/64 = 84 iterations So 20 epochs = 1680 iterations 50 epochs = 4200 iterations 100 epochs = 8400 iterations

Set

 steps=1680, 4200, 8400 
 scales=0.1, 0.1, 0.1

instead of https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/cfg/yolov3.cfg#L22-L23

and learning_rate=0.01 instead of https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/cfg/yolov3.cfg#L18

aditbhrgv commented 5 years ago

Hi @AlexeyAB , I got the below result after following the above LR schedule.

learning_rate=0.01 steps=1680, 4200, 8400 scales=0.1, 0.1, 0.1

chart

But , I trained this without random option in .cfg file. I can try to train with random option in .cfg file again and obtain the results again. Looking at the mAP graph, I think I reduced the LR too quickly as it converged to 75% mAP finallly which could be better around 82% (as seen from the graph.) I will try to set "scales=0.05, 0.05, 0.05" in .cfg file again and see the results. Do you have any other suggestions?

Also, can I generate a video of the predictions on the validation set using my trained model ? I can use "./build/darknet detector test" option to see the visualizations but it gives one image at a time. I want to give whole validation set and save the output.

AlexeyAB commented 5 years ago

Also, can I generate a video of the predictions on the validation set using my trained model ? I can use "./build/darknet detector test" option to see the visualizations but it gives one image at a time. I want to give whole validation set and save the output.

Are your validation images - frames from video? Just run detection on this video.


Also you can downlod http://mplayerwin.sourceforge.net/downloads.html and run this command in the folder where are only Validation images mencoder mf://*.jpg -mf w=1280:h=720:fps=15:type=jpg -ovc lavc -lavcopts vcodec=mpeg4:vbitrate=4000:mbd=2:trell -oac copy -o conveyor_valid.avi so videofile conveyor_valid.avi will be generated

Then run: ./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi


Also you can try

./darknet detector test data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights < data/conveyor_valid.txt

aditbhrgv commented 5 years ago

Are your validation images - frames from video?

No, they are .jpg files located in a folder.

aditbhrgv commented 5 years ago

Also you can downlod http://mplayerwin.sourceforge.net/downloads.html and run this command in the folder where are only Validation images

Is there same tool for Ubuntu ?

AlexeyAB commented 5 years ago

@aditbhrgv https://tecadmin.net/install-mencoder-and-mplayer-on-linux/

aditbhrgv commented 5 years ago

./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

@AlexeyAB I used this command to draw the BB on the .avi but I see a bit of offset on the detected objects. What can be a problem?

AlexeyAB commented 5 years ago

May be wrong annotations, check dataset by using https://github.com/AlexeyAB/Yolo_mark

aditbhrgv commented 5 years ago

annotations

I tested on single image and the BB is perfectly overlaid on the image using "/darknet detector test" command. It seems it's only a problem when I give input .avi video. I see the offsets for the objects when they are relatively closer and not when they are at a some distance away. Maybe, I can try ./darknet detector test data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights < data/conveyor_valid.txt instead of

./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

aditbhrgv commented 5 years ago

@AlexeyAB How can I reduce the fps of the output video generated ? It's too fast as of now.

AlexeyAB commented 5 years ago

Change 1st line and comment 2nd: https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/src/demo.c#L186-L187

aditbhrgv commented 5 years ago

chart

@AlexeyAB Now, I get the new mAP which converged around 81%. Precision = 84%, REcall = 71% F1 = 77%. . However, these results I got without using "random" flag. I think results can be better with multi-scale option.

AlexeyAB commented 5 years ago

Yes, try to train with random=1

aditbhrgv commented 5 years ago

chart @AlexeyAB I tried with random=1 option, but mAP, precision, recall and F1 reduced instead of increasing. Could you please suggest something? Thanks

aditbhrgv commented 5 years ago

cloud

@AlexeyAB I have a new dataset for which cloud.png is shown. How can I set mask for the anchors according to this distribution? Is there any link where I can better understand cloud.png interpretation?

DarylWM commented 5 years ago

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.

aditbhrgv commented 5 years ago

HI @DarylWM Thank you ! Can you please explain the significance of cloud.png.? I can see the anchors and the training data points distributed along them. Is my understanding correct ? If yes, how does the training samples lying outside these anchors will be detected ? Thanks again !

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.