Open silvernine209 opened 5 years ago
I haven't used AWS in a while but last time I did, it was just like any other VM. Just install darknet and train. What kind of help are you looking for?
@PeterQuinn925 Thank you for the reply. I have 1.6 milion images with 500 categories to train for. I've already trained up to 43,000 iterations on my local machine, and looking for a lot more iterations.
I'm looking to continue training using AWS instance, does this mean I have to upload or move all of my weights, images, repo, and etc to the VM you are talking about?
I was anticipating this to happen with AWS instance: 1.) Initiate AWS E3 instance 2.) Connect to the E3 instance on my command prompt on my Windows 3.) Begin training using GPU from E3 instance using "darknet.exe detector train data/obj.data yolo-obj.cfg backup-transfer_learning/yolo-obj_43000.weights"
I feel like I'm missing something fundamental..
Yeah. I think you do need to copy the exe, cfg, weights, images to your AWS instance. Perhaps someone with experience with doing this on AWS can answer.
To train on Amazon AWS EC2:
t2.micro
and AMI (Ubuntu) Version 3.0 (ami-38c87440)
- Deep Learning base AMI with NVidia drivers like CUDA 8 and 9, CuDNN 6 and 7p3.2xlarge
(Tesla V100), but you may need to contact support to set a limitGPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the MakefileFor connection to the Amazon EC2 runing instance - you can use Putty: https://putty.org.ru/download.html
Copy file yolo_43000.weights
from your computer to the EC2:
pscp -i t1-free.ppk yolo_43000.weights ubuntu@ec2-35-160-228-91.us-west-2.compute.amazonaws.com:/home/ubuntu/
Copy file dataset.zip
from your computer to the EC2:
pscp -i t1-free.ppk dataset.zip ubuntu@ec2-35-160-228-91.us-west-2.compute.amazonaws.com:/home/ubuntu/
Use putty.exe
and your Amazon EC2 private key (for example t1-free.ppk
) to connect to the t2.micro
Linux console
Copy file /home/ubuntu/a.weights
from EC2 to your computer to the path C:\Users\Alex\Desktop\Amazon\results
use this command:
pscp -i t1-free.ppk ubuntu@ec2-35-160-228-91.us-west-2.compute.amazonaws.com:/home/ubuntu/a.weights C:\Users\Alex\Desktop\Amazon\results
Training:
for p2.xlarge
(Tesla K80) use GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=0
in the Makefile and train with -gpus 0,1
flag after 1000 iterations
for p3.2xlarge
(Tesla V100) use GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the Makefile and train with -dont_show
flag
For training with mjpeg_port 8090
flag for remove viewing Loss & mAP chart in your web-browser (Chrome/Firefix) you should open port 8090
in your Amazon EC2 instance: https://postmarkapp.com/support/article/1026-resolving-aws-port-25-throttling
@AlexeyAB i just open my jaw in amazemament of such wonderful help. Thank you!! I will give it a try today and report on how it goes.
@silvernine209
for p2.xlarge
(Tesla K80) use GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=0
in the Makefile and train with -gpus 0,1
flag after 1000 iterations
for p3.2xlarge
(Tesla V100) use GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the Makefile and train with -dont_show
flag
@AlexeyAB Thank you for your follow up comments. On top of your thorough explanations, I also went through Amazon's tutorials (link). I'm currently uploading my train set to my virtual machine. After this, I will upload darknet and my weight. I shouldn't have any problem from this point, and thank you for saving my day again!
@AlexeyAB p2.xlarge is $0.90/hour and p3.2xlarge is $3.06 per hour. Would you train 3.06/0.9 = 3.4 times faster with p3.2xlarge? If not, I will just stick with p2.xlarge.
Thanks
@silvernine209
p2.xlarge is $0.90/hour - GPU nVidia K80 - 8 736 Gflops: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla
p3.2xlarge is $3.06/hour - GPU nVidia V100 - 134 092 Gflops if is used GPU=1 OPENCV=1 CUDNN=1 CUDNN_HALF=1
(like Titan V with Tensor Cores): https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Volta_series
So theoretically p3.2xlarge is 15x times faster than p2.xlarge. I didn't test K80, but I think it is about 5-8x times. A little bit about Titan V100 p3.2xlarge: https://github.com/AlexeyAB/darknet/issues/407
Also follow my recommendations for the 1st 1000 iterations: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412380005
@AlexeyAB Should I use -gpus 0,1
flag after 1000 iterations even for p3.2xlarge
?
@silvernine209
p2.xlarge (2xGPU) - train with CUDNN_HALF=0
p3.2xlarge (1xGPU) - train 1st 1000 iterations with CUDNN_HALF=0
, then continue with CUDNN_HALF=1
p3.8xlarge (4xGPU) - train 1st 1000 iterations with CUDNN_HALF=0
and without -gpus 0,1,2,3
, then continue with CUDNN_HALF=1
and with flag -gpus 0,1,2,3
@AlexeyAB Thank you!!
@AlexeyAB I've been trying to train p2.xlarge with -gpus 0,1
and I kept getting this error
I think p2.xlarge is just one gpu not two.
@silvernine209 Yes, It looks like that p2.xlarge
uses only half of K80 GPU that contains 2 chips GK210.
I fixes my post.
Hi Alexey, thanks for your valuable help here. I'm using p3.2xlarge (Tesla V100). I have compiled darknet with Makefile settings GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 and everything runs smoothly. However when instead of modifying the Makefile I set these values in command line (i.e. _make GPU=1 CUDNN=1 CUDNNHALF=1 OPENCV=1), I get errors both during training:
and testing. Testing with debug gives:
I would prefer to not modify the Makefile, if possible. Do you know why I get an error when passing options as command line arguments?
- Compile Darknet with
GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the Makefile
I have never achieved seeing Loss-mAP curve on 8090 port although I have tried whatever you said in issues. Is it because I don't install opencv on ec2 instance? I am using p3.8xlarge and Deep Learning AMI (Ubuntu 16.04) Version 29.0.
I compile darknet with opencv in my local pc. Afterwards, I just run training on ec2 instance without installing opencv on it. It never gives error even if I re-compile darknet with opencv = 1
on ec2 instance. And training goes on. But no chance to see the chart. I have tried lots of combinations for security groups, for example please see the screen-shot below.
Training command is: ./darknet detector train /home/ubuntu/bdd100k/yolo_files/bdd100k.data cfg/option_1.cfg darknet53.conv.74 -dont_show -mjpeg_port 8090 -map
Also, Although I set the learning_rate
as 0.001, it is changing to 0.00003 or some different values around 100 iterations. According to your saying and my experiences on training with multi-gpu, I set cudnn_half=0
and don't use-gpus 0,1,2,3
argument until 1000 iteration. So why is learning rate is changing?
@earcz Yes, you should install OpenCV. Without OpenCV:
So why is learning rate is changing?
This is a story from the original repository to level out the increasing size of the batch when training on several GPUs
@AlexeyAB
Yes, you should install OpenCV.
Okay, I will try with opencv on ec2. As I understand from your response, I set security groups correctly, right?
This is a story from the original repository to level out the increasing size of the batch when training on several GPUs
Yes but , I don't use several GPU, just one. It still happens. After 1000 iterations, I was planning to train with multi-gpu. Should I use p3.2xlarge to get rid of this problem? otherwise what is the point of training a network uncontrollably. Am i wrong? Moreover, I have created a P-Diagram and configuration matrix for benchmarking hyper-parameters to specify best configuration for training. In this scenario, learning rate is changing randomly. How can I control it without trading-off training speed?
Also, Although I set the learning_rate as 0.001, it is changing to 0.00003 or some different values around 100 iterations.
Its burn_in (warm up) for the first 1000 iterations, don't wory.
for training on 4x GPUs:
batch/4
, subdivisions/4
, learning_rate/4
, max_batches*4
, steps*4
@silvernine209
- p2.xlarge (2xGPU) - train with
CUDNN_HALF=0
- p3.2xlarge (1xGPU) - train 1st 1000 iterations with
CUDNN_HALF=0
, then continue withCUDNN_HALF=1
- p3.8xlarge (4xGPU) - train 1st 1000 iterations with
CUDNN_HALF=0
and without-gpus 0,1,2,3
, then continue withCUDNN_HALF=1
and with flag-gpus 0,1,2,3
@AlexeyAB If we are expected to change CUDNN_HALF after 1000 iterations how is the best way to go about that in practice? Should we set max iterations to 1000 in the cfg file initially and then change it? Do we need to re-compile darknet after 1000 iterations before resuming training? Is there a better approach rather than making these changes manually?
Also, Although I set the learning_rate as 0.001, it is changing to 0.00003 or some different values around 100 iterations.
Its burn_in (warm up) for the first 1000 iterations, don't wory.
for training on 4x GPUs:
- sometimes better to set values
batch/4
,subdivisions/4
,learning_rate/4
,max_batches*4
,steps*4
- sometimes just train as usual
Thanks, @AlexeyAB!
Shouldn't batch size be bigger given that we now using 4 GPUs usually for data-parallel multi-GPU training I would expect batch=64*4
instead of batch=64/4
. Am I missing something? Thanks for the clarification.
I have some credit in AWS EC2 and would like to train using it. I did some research, but I'm still clueless how I can utilize AWS EC2 and train my darknet on my Windows setup...any help please?