AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.59k stars 7.95k forks source link

Training on Amazon EC2 #1380

Open silvernine209 opened 5 years ago

silvernine209 commented 5 years ago

I have some credit in AWS EC2 and would like to train using it. I did some research, but I'm still clueless how I can utilize AWS EC2 and train my darknet on my Windows setup...any help please?

PeterQuinn925 commented 5 years ago

I haven't used AWS in a while but last time I did, it was just like any other VM. Just install darknet and train. What kind of help are you looking for?

silvernine209 commented 5 years ago

@PeterQuinn925 Thank you for the reply. I have 1.6 milion images with 500 categories to train for. I've already trained up to 43,000 iterations on my local machine, and looking for a lot more iterations.

I'm looking to continue training using AWS instance, does this mean I have to upload or move all of my weights, images, repo, and etc to the VM you are talking about?

I was anticipating this to happen with AWS instance: 1.) Initiate AWS E3 instance 2.) Connect to the E3 instance on my command prompt on my Windows 3.) Begin training using GPU from E3 instance using "darknet.exe detector train data/obj.data yolo-obj.cfg backup-transfer_learning/yolo-obj_43000.weights"

I feel like I'm missing something fundamental..

PeterQuinn925 commented 5 years ago

Yeah. I think you do need to copy the exe, cfg, weights, images to your AWS instance. Perhaps someone with experience with doing this on AWS can answer.

AlexeyAB commented 5 years ago

To train on Amazon AWS EC2:


For connection to the Amazon EC2 runing instance - you can use Putty: https://putty.org.ru/download.html


Training:


For training with mjpeg_port 8090 flag for remove viewing Loss & mAP chart in your web-browser (Chrome/Firefix) you should open port 8090 in your Amazon EC2 instance: https://postmarkapp.com/support/article/1026-resolving-aws-port-25-throttling

s_A04D6F58B05B8B3E550A2A277B33D47FD793ED4EB2385C64E12471633A3A3622_1463172638016_Add+Additional+Ports
silvernine209 commented 5 years ago

@AlexeyAB i just open my jaw in amazemament of such wonderful help. Thank you!! I will give it a try today and report on how it goes.

AlexeyAB commented 5 years ago

@silvernine209

silvernine209 commented 5 years ago

@AlexeyAB Thank you for your follow up comments. On top of your thorough explanations, I also went through Amazon's tutorials (link). I'm currently uploading my train set to my virtual machine. After this, I will upload darknet and my weight. I shouldn't have any problem from this point, and thank you for saving my day again!

silvernine209 commented 5 years ago

@AlexeyAB p2.xlarge is $0.90/hour and p3.2xlarge is $3.06 per hour. Would you train 3.06/0.9 = 3.4 times faster with p3.2xlarge? If not, I will just stick with p2.xlarge.

image

Thanks

AlexeyAB commented 5 years ago

@silvernine209

So theoretically p3.2xlarge is 15x times faster than p2.xlarge. I didn't test K80, but I think it is about 5-8x times. A little bit about Titan V100 p3.2xlarge: https://github.com/AlexeyAB/darknet/issues/407

Also follow my recommendations for the 1st 1000 iterations: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412380005

silvernine209 commented 5 years ago

@AlexeyAB Should I use -gpus 0,1 flag after 1000 iterations even for p3.2xlarge?

AlexeyAB commented 5 years ago

@silvernine209

silvernine209 commented 5 years ago

@AlexeyAB Thank you!!

silvernine209 commented 5 years ago

@AlexeyAB I've been trying to train p2.xlarge with -gpus 0,1 and I kept getting this error

image I think p2.xlarge is just one gpu not two.

image

image link

AlexeyAB commented 5 years ago

@silvernine209 Yes, It looks like that p2.xlarge uses only half of K80 GPU that contains 2 chips GK210. I fixes my post.

mmartin56 commented 4 years ago

Hi Alexey, thanks for your valuable help here. I'm using p3.2xlarge (Tesla V100). I have compiled darknet with Makefile settings GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 and everything runs smoothly. However when instead of modifying the Makefile I set these values in command line (i.e. _make GPU=1 CUDNN=1 CUDNNHALF=1 OPENCV=1), I get errors both during training: Selection_064 and testing. Testing with debug gives: Selection_065

I would prefer to not modify the Makefile, if possible. Do you know why I get an error when passing options as command line arguments?

earcz commented 4 years ago
  • Compile Darknet with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 in the Makefile

I have never achieved seeing Loss-mAP curve on 8090 port although I have tried whatever you said in issues. Is it because I don't install opencv on ec2 instance? I am using p3.8xlarge and Deep Learning AMI (Ubuntu 16.04) Version 29.0.

I compile darknet with opencv in my local pc. Afterwards, I just run training on ec2 instance without installing opencv on it. It never gives error even if I re-compile darknet with opencv = 1 on ec2 instance. And training goes on. But no chance to see the chart. I have tried lots of combinations for security groups, for example please see the screen-shot below.

Training command is: ./darknet detector train /home/ubuntu/bdd100k/yolo_files/bdd100k.data cfg/option_1.cfg darknet53.conv.74 -dont_show -mjpeg_port 8090 -map

sec

Also, Although I set the learning_rate as 0.001, it is changing to 0.00003 or some different values around 100 iterations. According to your saying and my experiences on training with multi-gpu, I set cudnn_half=0 and don't use-gpus 0,1,2,3 argument until 1000 iteration. So why is learning rate is changing?

AlexeyAB commented 4 years ago

@earcz Yes, you should install OpenCV. Without OpenCV:

So why is learning rate is changing?

This is a story from the original repository to level out the increasing size of the batch when training on several GPUs

earcz commented 4 years ago

@AlexeyAB

Yes, you should install OpenCV.

Okay, I will try with opencv on ec2. As I understand from your response, I set security groups correctly, right?

This is a story from the original repository to level out the increasing size of the batch when training on several GPUs

Yes but , I don't use several GPU, just one. It still happens. After 1000 iterations, I was planning to train with multi-gpu. Should I use p3.2xlarge to get rid of this problem? otherwise what is the point of training a network uncontrollably. Am i wrong? Moreover, I have created a P-Diagram and configuration matrix for benchmarking hyper-parameters to specify best configuration for training. In this scenario, learning rate is changing randomly. How can I control it without trading-off training speed?

AlexeyAB commented 4 years ago

Also, Although I set the learning_rate as 0.001, it is changing to 0.00003 or some different values around 100 iterations.

Its burn_in (warm up) for the first 1000 iterations, don't wory.

for training on 4x GPUs:

matt-sharp commented 3 years ago

@silvernine209

  • p2.xlarge (2xGPU) - train with CUDNN_HALF=0
  • p3.2xlarge (1xGPU) - train 1st 1000 iterations with CUDNN_HALF=0, then continue with CUDNN_HALF=1
  • p3.8xlarge (4xGPU) - train 1st 1000 iterations with CUDNN_HALF=0 and without -gpus 0,1,2,3, then continue with CUDNN_HALF=1 and with flag -gpus 0,1,2,3

@AlexeyAB If we are expected to change CUDNN_HALF after 1000 iterations how is the best way to go about that in practice? Should we set max iterations to 1000 in the cfg file initially and then change it? Do we need to re-compile darknet after 1000 iterations before resuming training? Is there a better approach rather than making these changes manually?

avn3r commented 2 years ago

Also, Although I set the learning_rate as 0.001, it is changing to 0.00003 or some different values around 100 iterations.

Its burn_in (warm up) for the first 1000 iterations, don't wory.

for training on 4x GPUs:

  • sometimes better to set values batch/4, subdivisions/4, learning_rate/4, max_batches*4, steps*4
  • sometimes just train as usual

Thanks, @AlexeyAB!

Shouldn't batch size be bigger given that we now using 4 GPUs usually for data-parallel multi-GPU training I would expect batch=64*4 instead of batch=64/4. Am I missing something? Thanks for the clarification.