Training goes slow - what is allowed to change during training?

ido-ran commented 6 years ago

I've tagged about 183 images with 1 class. All the images are grayscale so I've change some parameters according to your suggestion in #660. I've set batch=64 and subdivisions=8. I've started by coping yolov3.cfg file.

Every iteration takes about 16 seconds. I ran it on AWS p2.xlarge machine which has Tesla GeForce. It means that every 100 iteration takes about 26-30 minutes and after 300 rounds the avg value is on 2.3 and detection is not doing well at all - it mostly randomly find things in 1%-2% at most.

To make things worth it is not crashing on

CUDA Error: out of memory
darknet: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)

I was wondering if it is OK to change the configuration to subdivisions=16? Also, it is OK to add more images? I've read that adding negative images (images without marks) can also help.

Is it possible that the grayscale images are the cause of the slow learning? When I train YOLO about a month ago to find milk-carton in color images about a month ago I remember being able to detect it with 76% after only 200 rounds.

AlexeyAB commented 6 years ago

Currently you should change channels=1 in the cfg-file to use 1-channel Images or Video: https://github.com/AlexeyAB/darknet/pull/936
What params do you use in the Makefile?

You should set GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 DEBUG=0 and run training with flag -dont_show

ido-ran commented 6 years ago

I'm trying now with all the flags you suggested beside OPENCV=1 because OpenCV is not installed on AWS AMI I'm using.

Is there any gain of using OpenCV if I don't want to see the visual result?

AlexeyAB commented 6 years ago

Is there any gain of using OpenCV if I don't want to see the visual result?

OpenCV accelerates training by removing a bottleneck on CPU (data augmentation) only if you use Tesla V100 p3.2xlarge but it isn't required if you use Tesla K80 p2.xlarge

ido-ran commented 6 years ago

Thank you, I'm using p2.xlarge so I guess it's not relevant.

The results of running with channels=1 were not good. This is the results of running darknet detector map on my 300 iteration weights:

detections_count = 0, unique_truth_count = 93  
class_id = 0, name = license-plate,      ap = 0.00 % 
 for thresh = 0.25, precision = -nan, recall = 0.00, F1-score = -nan 
 for thresh = 0.25, TP = 0, FP = 0, FN = 93, average IoU = 0.00 %

I've run detefctor test on all of my training files with threshold of 0.01 but there wan't even a single detection.

Training still crashed after 360 iterations 😞

AlexeyAB commented 6 years ago

300 iteration

300 iterations - it is too little. Train at least 1000 iterations, you can try to train with random=0 for fast training without Out of memory crashing.

Also try to train with channels=1 and channels=3, which of these will have mAP > 0?

I didn't test training with 1 channel after this fix: https://github.com/AlexeyAB/darknet/pull/936

ido-ran commented 6 years ago

I've trained darknet to detect milk bottle and after 300 iterations I was able to detect at least some of the images, I thought that not detecting anything after 300 iteration is a bad sign.

I'm running it again with random=0.

Two questions:

Is it OK to change random=0 or subdivisions after I've had partial weights (the weights file generated every 100 iterations)?
When I try to run 2 training at the same time I'm getting an out-of-memory right at the start of the second train, is there any way to run 2 training on the same machine or it requires too much GPU memory?

AlexeyAB commented 6 years ago

Is it OK to change random=0 or subdivisions after I've had partial weights (the weights file generated every 100 iterations)?

Better to train from the begining.

or it requires too much GPU memory?

It requires too much GPU memory.

AlexeyAB / darknet

Training goes slow - what is allowed to change during training? #1302