Trade-off between memory and speed

xiaohai12 commented 5 years ago

hello, I am trying to find a cheapest way to use GPU to train yolov3 of custom data. there are four choices of GPU: M60 (8G memory), V100 (16G), K80 (24G) (these three from AWS) and 1080 ti(10G) in my own machine. In the cfg file, I set batch=64 and subdivisions = 32. (if I set subdivisions=16 it will out of memory even for 16G memory). And I set CUDNN_HALF=1 in V100 server to compare them.

But the result is that 1080 ti took 2 hours to train1000 iterations and save a weights file, M60 took around 4 hours and V100 is a little bit slow the 1080 ti which is not the same with my expect. since I think V100 should faster than 1080 ti, and even we use tensor core to speed up.

I should use my 1080 ti GPU but I have no idea why it is always broken down when I run yolo code, so that I have to chose one in AWS which costs lot of money.

So that I would like to ask whether I can set subdivisions or batch number and height, width in the cfg file to speed up training without out of memory and also cost less money.

AlexeyAB commented 5 years ago

@xiaohai12 Hi,

I should use my 1080 ti GPU but I have no idea why it is always broken down when I run yolo code, so that I have to chose one in AWS which costs lot of money.

What error do you get?

V100 is a little bit slow the 1080 ti which is not the same with my expect. since I think V100 should faster than 1080 ti, and even we use tensor core to speed up.

You should know, that the first iterations will be the same slow for any GPU, because there is a bottleneck on HDD, if Loaded > 0 seconds

You should compile Darknet with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 and run training with flag -dont_show. OpenCV is required to avoid bottleneck on CPU for data augmentation.

Also usually Tensor cores requrest lower subdivisons (higher mini_batch=batch/subdivisions) for acceleration.

So that I would like to ask whether I can set subdivisions or batch number and height, width in the cfg file to speed up training without out of memory and also cost less money.

Lower subdivisions - accelerates training but lead to Out of memory. Lower width & height - accelerates training and detection but lead to worse accuracy.

In your case, may be better to set random=0 and lower subdivisons, so it can lead to slightly worse accuracy ~1%, but higer training speed.
Also you can set batch=64 subdivisions=16 or batch=60 subdivisions=20

xiaohai12 commented 5 years ago

@xiaohai12 Hi,

I should use my 1080 ti GPU but I have no idea why it is always broken down when I run yolo code, so that I have to chose one in AWS which costs lot of money.

What error do you get?

Thanks for your reply, the ./darknet command sometime will broke my server(inside the docker it is broken down), in this case when I input nvidia-smi (outside the docker) it will crash and I can do nothing with that, also when I use “top” command, I saw the process 13426 (irq/129-nvidia) occupy 100% CPU. But since I am not a Server administrator so I need to ask they to reboot the server.......

Also, thanks for suggestions about cfg file.

AlexeyAB / darknet

Trade-off between memory and speed #2966