CUDA Error Prev: an illegal memory access was encountered

vamsiduranc commented 4 years ago

We have downloaded latest master branch code and compiled darknet using Cmake-GUI. We are encountering an error "CUDA Error Prev: an illegal memory access was encountered" at specific interval of time. Can you please let us know how can we fix this issue?

Below are the details:

 7218: 0.624855, 0.829652 avg loss, 0.002000 rate, 1.766000 seconds, 923904 images
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
4
 CUDA Error Prev: an illegal memory access was encountered

CUDA Error Prev: an illegal memory access was encountered: No error

Current Server Details: Operating System: Windows Server 2016 Processor: Intel Xeon E5-2690 v3 2.6GHz RAM: 112GB GPU Card: Tesla K80 - 2 Nos.

 CUDA-version: 10010 (10010), cuDNN: 7.6.5, GPU count: 2
 OpenCV version: 3.4.0
 compute_capability = 370, cudnn_half = 0
net.optimized_memory = 0

Please let us know if you need more information. Thanks in advance!

AlexeyAB commented 4 years ago

Can you successfully calculate the mAP?
Can you successfully train or calculate the mAP with flag in command line -cuda_debug_sync ?

vamsiduranc commented 4 years ago

@AlexeyAB Thank you for your reply...

I have tried starting training using -cuda_debug_sync option you suggested and we were able to capture error message when training fails. Below is the error received when training exited:

 (next mAP calculation at 6046 iterations)
 6046: 1.269026, 1.584092 avg loss, 0.002000 rate, 3.268000 seconds, 773888 images
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
4
 cudaError_t status = cudaDeviceSynchronize() Error in: file: C:/TrainingSoftware/darknet-master/src/convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 544 : build time: Mar 20 2020 - 04:06:14 
CUDA status = cudaDeviceSynchronize() Error: file: C:\TrainingSoftware\darknet-master\src\dark_cuda.c : cuda_get_device() : line: 46 : build time: Mar 20 2020 - 04:06:20

 CUDA Error: an illegal memory access was encountered

CUDA Error: an illegal memory access was encountered: No error

I was able to calculate mAP for 6000 weight using -cuda_debug_sync option mAP Result:

 for conf_thresh = 0.25, precision = 0.80, recall = 0.62, F1-score = 0.70
 for conf_thresh = 0.25, TP = 30345, FP = 7729, FN = 18289, average IoU = 60.73 %

 IoU threshold = 50 %, used Area-Under-Curve for each unique Recall
 mean average precision (mAP@0.50) = 0.645502, or 64.55 %
Total Detection Time: 394 Seconds

Request you please look into this and let us know how can we fix this issue.

AlexeyAB commented 4 years ago

I have tried starting training using -cuda_debug_sync option you suggested and we were able to capture error message when training fails.

Can you show the error message if you train with 2 flags -cuda_debug_sync -benchmark_layers ?
Can you show the error message if you Compile with CUDNN=0 CUDNN_HALF=0 and train with 2 flags -cuda_debug_sync -benchmark_layers ?
Attach your cfg-file
Do you train with 1 or 2 GPUs?
Show output of commands
```
nvidia-smi
nvcc --version
```

So it seems that the error is there: https://github.com/AlexeyAB/darknet/blob/92e6e8eece3b789616430cad51a5afdb5a3153fc/src/convolutional_kernels.cu#L532-L544

vamsiduranc commented 4 years ago

Dear @AlexeyAB

Below are the responses:

Can you show the error message if you train with 2 flags -cuda_debug_sync -benchmark_layers ?

Here is the output of the error: CudaError_CUDNN=1.txt

Can you show the error message if you Compile with CUDNN=0 CUDNN_HALF=0 and train with 2 flags -cuda_debug_sync -benchmark_layers ?

We are using Windows Server 2016 for training (Not Linux). I am not sure how to set CUDNN=0

Attach your cfg-file

Here is the CFG duranc_all_tiny_3l_v9_11_LPR.txt

Do you train with 1 or 2 GPUs?

Currently using 2 GPUs (2 - K80 Cards). Trained for 1000 iterations with single GPU and then used -gpus 0,1 option

Show output of commands

I am using Windows 2016 Server.

AlexeyAB commented 4 years ago

@vamsiduranc Thanks! I added minor fix.

Download the latest version of Darknet.

Ca you again show error message in two cases:

Can you show the error message if you train with 2 flags -cuda_debug_sync -benchmark_layers ?
Then open \darknet.sln in MSVS -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions remove CUDNN; and CUDNN_HALF; recompile, run training with 2 flags -cuda_debug_sync -benchmark_layers and show error message

Also

Show content of obj.data file
Show both command from Training and for mAP calculation

vamsiduranc commented 4 years ago

@AlexeyAB here are the details:

I have downloaded latest version and compiled by removing CUDNN and CUDNN_HALF Darknet_CUDNN=0_CUDNN_HALF=0

When I start training using below command, training did NOT start...

darknet.exe detector train cfg/duranc_all_tiny_3l_v9_11_LPR.data cfg/duranc_all_tiny_3l_v9_11_LPR.cfg yolov3-tiny.conv.15 -map -cuda_debug_sync -benchmark_layers

Error Output: ERROR_cudnn=0_cudnn_half=0_23Mar2020.txt

Show content of obj.data file

classes = 44
train = cfg/duranc_all_tiny_3l_v9_11_LPR_train.txt
valid = cfg/duranc_all_tiny_3l_v9_11_LPR_valid.txt
names = cfg/duranc_all_tiny_3l_v9_11_LPR.names
backup = backup

Show both command from Training and for mAP calculation

Training:

darknet.exe detector train cfg/duranc_all_tiny_3l_v9_11_LPR.data cfg/duranc_all_tiny_3l_v9_11_LPR.cfg yolov3-tiny.conv.15 -map -cuda_debug_sync -benchmark_layers

mAP Calculation:

darknet.exe detector map cfg/duranc_all_tiny_3l_v9_11_LPR.data cfg/duranc_all_tiny_3l_v9_11_LPR_416.cfg backup/duranc_all_tiny_3l_v9_11_LPR_1000.weights -cuda_debug_sync -benchmark_layers

AlexeyAB commented 4 years ago

open \darknet.sln in MSVS -> (right click on project) -> properties -> CUDA C/C++ -> Device-> Code Generation Add at the begining of line compute_30,sm_30;
Add line printf(" fill_ongpu: N = %d, X = %p \n", N, X); between these two lines https://github.com/AlexeyAB/darknet/blob/864d1062f875e7fb61be053c784d86f9105d4c6d/src/blas_kernels.cu#L812-L813

Recompile and run training with -cuda_debug_sync -benchmark_layers and show error message
Also show again such message

CUDA-version: 10010 (10010), cuDNN: 7.6.5, GPU count: 2 OpenCV version: 3.4.0 compute_capability = 370, cudnn_half = 0 net.optimized_memory = 0

vamsiduranc commented 4 years ago

@AlexeyAB Here are the details:

open \darknet.sln in MSVS -> (right click on project) -> properties -> CUDA C/C++ -> Device-> Code Generation Add at the begining of line compute_30,sm_30;

These two options are already set before. So I did not add them. Below is the screenshot

Add line printf(" fill_ongpu: N = %d, X = %p \n", N, X); between these two lines

Added printf in _src/blaskernels.cu file as suggested...

Recompiled darknet and started training with -cuda_debug_sync -benchmark_layers flags. Here is the error message: Darknet_Error_23Mar2020.txt

If you would like to connect to server at look into the issue, I will be happy to share zoom link to join. OR if you want to continue offline via Git... I am okay to provide information as and when needed. Please let me know.. Thanks in Advance!

AlexeyAB commented 4 years ago

Try to download new code

Add at the begining of line compute_30,sm_30;

These two options are already set before

Sorry, I meant compute_37,sm_37;

recompile and run with -cuda_debug_sync -benchmark_layers flags.

Also try to train with 1 GPU, do you get this error?

vamsiduranc commented 4 years ago

@AlexeyAB

Steps Performed:

Downloaded latest darknet software
Removed CUDNN and CUDNN_HALF
Added compute_37,sm_37;
Added printf after Line 816 in src\blas_kernels.cu file
Recompiled darknet and started training with -cuda_debug_sync -benchmark_layers flags flags

Here is the error output: Darknet_Error_23Mar2020_1.txt

Also try to train with 1 GPU, do you get this error?

We have one more server with single GPU-QuadroP6000. We downloaded latest code you committed last week and started training on March 21, 2020 and it is still going on without any error.

Side Note: Currently the server which we are using is from Microsoft Azure (Standard NC12_Promo (12 vcpus, 112 GiB memory) which is having dual GPU - K80 Cards.

vamsiduranc commented 4 years ago

@AlexeyAB Any further updates on this issue?

AlexeyAB commented 4 years ago

@vamsiduranc This is very strange issue with GPU.

Try several options and start training with each of them with a flags -cuda_debug_sync -benchmark_layers flags and show error message for each case:

Un-comment this line: https://github.com/AlexeyAB/darknet/blob/a234a5022333c930de08f2470184ef4e0c68356e/src/blas_kernels.cu#L815
Comment this line: https://github.com/AlexeyAB/darknet/blob/a234a5022333c930de08f2470184ef4e0c68356e/src/blas_kernels.cu#L816
Don't remove CUDNN;CUDNN_HALF;

vamsiduranc commented 4 years ago

@AlexeyAB Here are the test results...

Point No.1: Un-comment the line:

Error Details: Darknet_Error_24Mar2020.txt

Point 2: Comment the line:

Error Details: Darknet_Error_24Mar2020_1.txt

AlexeyAB commented 4 years ago

@vamsiduranc Thanks! Can you do the same with CUDNN; without CUDNN_HALF;

vamsiduranc commented 4 years ago

@AlexeyAB here are the details without CUDNN_HALF

Option 1: Error Details: Darknet_Error_25Mar2020.txt

Option 2: Error Details: Darknet_Error_25Mar2020_1.txt

AlexeyAB commented 4 years ago

@vamsiduranc

Do you get this error if you train with random=0 in the last [yolo] layer?
Just to find an error place, try to train with 1 GPU, and show error message.

Also try to train with 1 GPU, do you get this error?

We have one more server with single GPU-QuadroP6000. We downloaded latest code you committed last week and started training on March 21, 2020 and it is still going on without any error.

Side Note: Currently the server which we are using is from Microsoft Azure (Standard NC12_Promo (12 vcpus, 112 GiB memory) which is having dual GPU - K80 Cards.

vamsiduranc commented 4 years ago

@AlexeyAB

Do you get this error if you train with random=0 in the last [yolo] layer?

YES. We see error and training is not getting started. Here is the error message: Darknet_Error_26Mar2020.txt

Just to find an error place, try to train with 1 GPU, and show error message.

Currently there is training going on that server and we do not see error in that server. Just curious to know, we have not started training with -gpus 0,1 on this current server. We are just trying to start with 1 GPU only. Will the command darknet.exe detector train cfg/duranc_all_tiny_3l_v9_11_LPR.data cfg/duranc_all_tiny_3l_v9_11_LPR.cfg yolov3-tiny.conv.15 -map -cuda_debug_sync -benchmark_layers still use 2 GPUs?

AlexeyAB commented 4 years ago

Will the command darknet.exe detector train cfg/duranc_all_tiny_3l_v9_11_LPR.data cfg/duranc_all_tiny_3l_v9_11_LPR.cfg yolov3-tiny.conv.15 -map -cuda_debug_sync -benchmark_layers still use 2 GPUs?

No, this command uses only 1 GPU.

Did you start it on the same GPU K80 (2 x chips) server?

If yes, then may be the issue in the multi-GPU training on old GPUs.

Do you get this error if you train with random=0 in the last [yolo] layer?

YES. We see error and training is not getting started. Here is the error message: Darknet_Error_26Mar2020.txt

Did you start training with commented line fill_ongpu() ?

vamsiduranc commented 4 years ago

@AlexeyAB ,

Did you start it on the same GPU K80 (2 x chips) server?

Yes. I got this error when I started training in multi-gpu server (K80 x 2 cards)

If yes, then may be the issue in the multi-GPU training on old GPUs.

What we should do now? Do we need to change the server to single K80 card? We see memory error with single K80 card server when we set subdivisions=4 so we have set it to subdivisions=16 and started training but training is becoming we slow and taking lot of time to process images. That is the reason, we took server with 2 x K80 cards to speedup the training process. But this seems to be not working. Can you please suggest what can be done now?

Did you start training with commented line fill_ongpu() ?

Below is the screenshot from blas_kernels.cu file. Let me know if I have to try something else...

AlexeyAB commented 4 years ago

@vamsiduranc Thanks!

Just to find an error place, try to train with 1 GPU, and show error message.

Currently there is training going on that server and we do not see error in that server.

Did you start it on the same GPU K80 (2 x chips) server?

Yes. I got this error when I started training in multi-gpu server (K80 x 2 cards)

I mean did you start training on GPU K80 (2 x chips) server by using only 1 GPU (wihtout -gpus 0,1) and you didn't get any error?
Do you get an error if you use -gpus 0 ?
Do you get an error if you use -gpus 0,0 (increase subdivisions)?
Also why did you chose to use old K80 server (september 2014) 5 Tflops rather than RTX 2080 (september 2018) 10 TFlops $800?

vamsiduranc commented 4 years ago

@AlexeyAB

I mean did you start training on GPU K80 (2 x chips) server by using only 1 GPU (wihtout -gpus 0,1) and you didn't get any error?

Yes. I started training on 2xK80 GPU Server without -gpus 0,1 and got error.

Do you get an error if you use -gpus 0 ?

We see below error:

Resizing, random_coef = 1.40

 608 x 608
 try to allocate additional workspace_size = 517.00 MB
 CUDA allocate done!
Loaded: 0.000000 seconds
 fill_ongpu: N = 94633984, X = 0000001F74880000
CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : simple_copy_ongpu() : line: 717 : build time: Mar 25 2020 - 04:00:43

 CUDA Error: invalid device function
CUDA Error: invalid device function: No error
Assertion failed: 0, file ..\..\src\utils.c, line 325

Do you get an error if you use -gpus 0,0 (increase subdivisions)?

We see below error:

Resizing, random_coef = 1.40

 608 x 608
 try to allocate additional workspace_size = 517.00 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 517.00 MB
 CUDA allocate done!
Loaded: 0.000000 seconds
 fill_ongpu: N = 94633984, X = 000000202F400000
 fill_ongpu: N = 94633984, X = 0000001F74880000
CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : simple_copy_ongpu() : line: 717 : build time: Mar 25 2020 - 04:00:43
CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : simple_copy_ongpu() : line: 717 : build time: Mar 25 2020 - 04:00:43

 CUDA Error: invalid device function

 CUDA Error: invalid device function

CUDA Error: invalid device function: No error
Assertion failed: 0, file ..\..\src\utils.c, line 325

Also why did you chose to use old K80 server (september 2014) 5 Tflops rather than RTX 2080 (september 2018) 10 TFlops $800?

We could not find RTX2080 cards in Azure Cloud GPU Server Family. Only Below GPU cards are available on Azure...

Tesla K80 GPU
Tesla P100 GPU
Tesla V100 GPU
Tesla P40 GPU
Tesla M60 GPU

Note: I had a zip file of OLD Darknet downloaded on 21-Nov-2019. So I ran darknet using this version and I was able to start training on this 2xK80 Server. Hope this version might be helpful in debugging what is conflicting with your latest code. Here is the link to download zip file https://1drv.ms/u/s!Ak40GK_JLROjgtMF-pk9ikUroPqvdA?e=RfAwWS

AlexeyAB commented 4 years ago

I would suggest you to use P100 or V100.
Also can you try successfully if you use random=0 in the last [yolo] layer?

I had a zip file of OLD Darknet downloaded on 21-Nov-2019. So I ran darknet using this version and I was able to start training on this 2xK80 Server. Hope this version might be helpful in debugging what is conflicting with your latest code. Here is the link to download zip file https://1drv.ms/u/s!Ak40GK_JLROjgtMF-pk9ikUroPqvdA?e=RfAwWS

Thanks, I will check what was changed. Do you use the same server/OS/CUDA/drivers... for both Old and New Darknet?

vamsiduranc commented 4 years ago

@AlexeyAB

I would suggest you to use P100 or V100.

Thank you. We will check the pricing and feasibility...

Also can you try successfully if you use random=0 in the last [yolo] layer?

Yes. I tried with this option but still we get this error... Please check if there is any issue in CFG as well. K80_CFG.txt

Do you use the same server/OS/CUDA/drivers... for both Old and New Darknet?

Yes. I have tested both versions on same 2xK80 server.

Request: On a separate note, our CTO wants to interact with you personally over email. Can you please send a test email to my email id vamsi@duranc.com. Once I receive your email, I will have our CTO interact with you. Thanks in Advance!

vamsiduranc commented 4 years ago

@AlexeyAB any update for today?

AlexeyAB commented 4 years ago

Check that you set compute_37,sm_37; in open \darknet.sln in MSVS -> (right click on project) -> properties -> CUDA C/C++ -> Device-> Code Generation

Try to change these lines: https://github.com/AlexeyAB/darknet/blob/9a2344759b119faf9df98dbeed81650c03650ecd/src/blas_kernels.cu#L731-L736 To these:

extern "C" void simple_copy_ongpu(int size, float *src, float *dst)
{
cudaError_t status = cudaMemcpyAsync(dst, src, size*sizeof(float), cudaMemcpyDefault, get_cuda_stream());
CHECK_CUDA(status);
const int num_blocks = size / BLOCK + 1;
//simple_copy_kernel << <num_blocks, BLOCK, 0, get_cuda_stream() >> >(size, src, dst);
CHECK_CUDA(cudaPeekAtLastError());
}

compile with GPU;CUDNN; and without CUDNN_HALF, open \darknet.sln in MSVS -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions
Try to train with random=1 and -gpus 0,1

Show me the error message

vamsiduranc commented 4 years ago

@AlexeyAB

Downloaded latest software from Git.

Check that you set compute_37,sm_37; in open \darknet.sln in MSVS -> (right click on project) -> properties -> CUDA C/C++ -> Device-> Code Generation

Changes done in darknet/src/blas_kernels.cu

compile with GPU;CUDNN; and without CUDNN_HALF, open \darknet.sln in MSVS -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions

Try to train with random=1 and -gpus 0,1

Training Command Used:

darknet.exe detector train cfg/duranc_all_tiny_3l_v9_11_LPR.data cfg/duranc_all_tiny_3l_v9_11_LPR.cfg yolov3-tiny.conv.15 -map -cuda_debug_sync -benchmark_layers -gpus 0,1

Here is the Error: Darknet_Error_27Mar2020.txt

AlexeyAB commented 4 years ago

Thanks. Try also comment this line: https://github.com/AlexeyAB/darknet/blob/a234a5022333c930de08f2470184ef4e0c68356e/src/blas_kernels.cu#L816

vamsiduranc commented 4 years ago

@AlexeyAB

I do not see fill_kernel in Line 816. Below is the screenshot for reference.

I see fill_kernel in line 842

Let me know which line to comment?

AlexeyAB commented 4 years ago

This

I see fill_kernel in line 842

vamsiduranc commented 4 years ago

@AlexeyAB

I have commented fill_kernel in line 842 and compiled darknet, then started training. Here is the error message: Darknet_Error_30Mar2020.txt

Also, what I observed when I tried to start training on this 2 x K80 server using old darknet which was downloaded on 21-Nov-2019 is that the training is failing for every 6000 to 7000 iterations with below error message.

v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 30 Avg (IOU: 0.618155, GIOU: 0.603350), Class: 0.845800, Obj: 0.330447, No Obj: 0.000843, .5R: 0.750000, .75R: 0.166667, count: 24
Syncing... Done!

 (next mAP calculation at 29958 iterations)
 29960: 1.595268, 1.345012 avg loss, 0.002000 rate, 1.568000 seconds, 3834880 images
Resizing to initial size: 416 x 416
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
4CUDA Error Prev: an illegal memory access was encountered

CUDA Error Prev: an illegal memory access was encountered: No error
Assertion failed: 0, file ..\..\src\utils.c, line 295

AlexeyAB commented 4 years ago

@vamsiduranc

Request: On a separate note, our CTO wants to interact with you personally over email. Can you please send a test email to my email id vamsi@duranc.com. Once I receive your email, I will have our CTO interact with you. Thanks in Advance!

I wrote you an email.

This is strange that you get an error for the simplest function in which there are definitely no errors, even if synchronization occurs before it.

Perhaps something is wrong with the: CUDA, GPU driver, CUDA-Arch, or setting some flags.
Do you compile Darknet on the same server where you run it?
Try to comment this line: https://github.com/AlexeyAB/darknet/blob/0e063371500bc998584aa58313cee04b5cf354c4/src/darknet.c#L465

CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : fill_ongpu() : line: 843 : build time: Mar 27 2020 - 14:59:12

CUDA Error: invalid device function CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : fill_ongpu() : line: 843 : build time: Mar 27 2020 - 14:59:12

CUDA Error: invalid device function

CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : simple_copy_ongpu() : line: 717 : build time: Mar 25 2020 - 04:00:43

CUDA Error: invalid device function

CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : fill_ongpu() : line: 843 : build time: Mar 27 2020 - 14:59:12

CUDA Error: invalid device function CUDA status Error: file: E:/darknet-master/src/blas_kernels.cu : fill_ongpu() : line: 843 : build time: Mar 27 2020 - 14:59:12

Loaded: 0.000000 seconds CUDA status Error: file: E:/darknet-master/src/activation_kernels.cu : activate_array_ongpu() : line: 399 : build time: Mar 29 2020 - 16:05:13

vamsiduranc commented 4 years ago

@AlexeyAB

I wrote you an email.

Thank You. Received your email and forwarded your details to our CTO.

Today I have uninstalled GPU Drivers, CUDA and CUDNN from this K80 server and reinstalled. Then downloaded latest software version from GIT and compiled just by removing CUDNN_HALF and inserting compute_37,sm_37;

Started training with darknete.exe and it started without any error. But received error at regular intervals.

 (next mAP calculation at 5958 iterations)
 5958: 0.859459, 0.952438 avg loss, 0.002000 rate, 2.359000 seconds, 762624 images, 211.002743 time left
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
4
 cudaError_t status = cudaDeviceSynchronize() Error in: file: E:/darknet-master/src/convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 544 : build time: Mar 30 2020 - 14:37:09
CUDA status = cudaDeviceSynchronize() Error: file: ..\..\src\dark_cuda.c : cuda_get_device() : line: 46 : build time: Mar 30 2020 - 14:37:16

 CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: No error
Assertion failed: 0, file ..\..\src\utils.c, line 325

 (next mAP calculation at 17958 iterations)
 17958: 0.781567, 0.846414 avg loss, 0.002000 rate, 3.618000 seconds, 2298624 images, 190.783013 time left
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
4
 cudaError_t status = cudaDeviceSynchronize() Error in: file: E:/darknet-master/src/convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 544 : build time: Mar 30 2020 - 14:37:09
CUDA status = cudaDeviceSynchronize() Error: file: ..\..\src\dark_cuda.c : cuda_get_device() : line: 46 : build time: Mar 30 2020 - 14:37:16

 CUDA Error: an illegal memory access was encountered

CUDA Error: an illegal memory access was encountered: No error
Assertion failed: 0, file ..\..\src\utils.c, line 325

AlexeyAB commented 4 years ago

So try to train without -map flag. You can check the mAP later by using command ./darknet detector map obj.data my.cfg my.weights

vamsiduranc commented 4 years ago

@AlexeyAB One observation please...

Training stopped at 17958 iterations. So when I try to restart training using below below command using xxxx_17000.weight file, the next iteration number in training starts from 19000. Can you please let us know what went wrong or is it expected behavior?

 (next mAP calculation at 17958 iterations)
 17958: 0.781567, 0.846414 avg loss, 0.002000 rate, 3.618000 seconds, 2298624 images, 190.783013 time left
Resizing to initial size: 416 x 416  try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 1245.71 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
4
 cudaError_t status = cudaDeviceSynchronize() Error in: file: E:/darknet-master/src/convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 544 : build time: Mar 30 2020 - 14:37:09
CUDA status = cudaDeviceSynchronize() Error: file: ..\..\src\dark_cuda.c : cuda_get_device() : line: 46 : build time: Mar 30 2020 - 14:37:16

 CUDA Error: an illegal memory access was encountered

CUDA Error: an illegal memory access was encountered: No error
Assertion failed: 0, file ..\..\src\utils.c, line 325

Command used:

darknet.exe detector train cfg/duranc_all_tiny_3l_Az_10_1.data cfg/duranc_all_tiny_3l_Az_10_1.cfg backup/duranc_all_tiny_3l_Az_10_1_17000.weights -gpus 0,1 -cuda_debug_sync

Training Output is here: Darknet_Training_Output_31Mar2020.txt

You can see the backup folder screenshot below.... It missed few iterations whenever training is restarted after error message

AlexeyAB commented 4 years ago

@vamsiduranc

Do you use the latest Darknet version?
Do you use random=1 and/or dynamic_minibatch=1 ?
Did you change batch= value?
It seems seen value was saved incorrectly to the weights-file. I added fix: https://github.com/AlexeyAB/darknet/commit/2f9f4a40d34bb344e5c6667fa34dae4caa8e2aa3

vamsiduranc commented 4 years ago

@AlexeyAB

Do you use the latest Darknet version?

YES. Downloaded latest version and compiled.

Do you use random=1 and/or dynamic_minibatch=1 ?

random=1

Did you change batch= value?

learning_rate=0.001
burn_in=1000
max_batches = 200000
policy=steps
steps=180000,190000
scales=.1,.1

Regarding: https://github.com/AlexeyAB/darknet/issues/5075#issuecomment-606674922 I have started training without -map flag and training seems to be continuous. Started at 17000 iteration yesterday and now it crossed 59000 iteration without any error.

I would like to thank you for your patience and effort in making this happen. Thank You!

vamsiduranc commented 4 years ago

@AlexeyAB

One quick clarification please... We are observing that Yolo doesn't detect objects unless there is movement. If they are standing there for some time and not moving it stops detection. For example, In a traffic scenario, our neural network detects vehicle when they are crossing the junction. Suppose if any vehicle stops at the signal point, then detection of the vehicle stops (bounding box around the vehicle disappears). It once again starts detection when the vehicle starts moving from that place.

Can you please let us know what has to be done to keep the bounding boxes appear all the time. Thanks in Advance!

AlexeyAB commented 4 years ago

@vamsiduranc

What cfg-file do you use?
Do you use ./darknet detector demo ... or your own code?
Can you show video with example, preferably with several examples?

vamsiduranc commented 4 years ago

@AlexeyAB Sorry for the delay! Below are the responses...

What cfg-file do you use?

Here is the current CFG file in use: CFG.txt

Do you use ./darknet detector demo ... or your own code?

We see this behavior in darknet detector demo

Can you show video with example, preferably with several examples?

Below is the link for recorded video of the detection. You can see clearly when the cars stop at the lane (towards left of the video window where there is a white line) , the detection of the vehicle stops (bounding box disappears) and when the vehicle starts moving, detection happens.

https://1drv.ms/u/s!Ak40GK_JLROjgvsWnSK0HJWCnozWAA?e=D7AJgT

Can you please let us know what has to be done to improve training so that detection happens all the time? Thank You!

AlexeyAB commented 4 years ago

@vamsiduranc

The reason is not what car stop, but because the car in the background at this moment is not suitable. The models isn't trained well.

Did you train it by yourself and what dataset did you use?
Do you use separate train and valid datrasets?
What mAP do you get?
Try to use default yolov3-tiny.cfg/weights model that is already trained on MS COCO with this videofile

vamsiduranc commented 4 years ago

@AlexeyAB

Did you train it by yourself and what dataset did you use?

We have our own data set and train custom objects. We are not using COCO data set.

Do you use separate train and valid datrasets?

We have separate train and valid data sets with below counts. Validation set will not contain any of the training set images (both are different set of images). Training Set: 158488 images, Validation Set: 17561 images

What mAP do you get?

Training got completed till 145K iterations and video was tested with 145K weight file. Below are the mAP statistics.

_145000
 for conf_thresh = 0.25, precision = 0.81, recall = 0.79, F1-score = 0.80
 for conf_thresh = 0.25, TP = 49838, FP = 11617, FN = 13504, average IoU = 63.32 %

 IoU threshold = 50 %, used Area-Under-Curve for each unique Recall
 mean average precision (mAP@0.50) = 0.747307, or 74.73 %
Total Detection Time: 1646 Seconds

Try to use default yolov3-tiny.cfg/weights model that is already trained on MS COCO with this videofile

Command used: darknet.exe detector demo cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/videos/bosch-traffic-1.mp4

Here is the recorded video clip using COCO. In the recorded video we see lot of double detection and false classes displayed on the vehicles. We see same type of behavior with COCO set as well. https://1drv.ms/u/s!Ak40GK_JLROjgvsat76Mf41RQMLyqw?e=9qCb0x

Below are the annotations count for each class we used in this training.

Based on the above data, what would be your suggestions or how would you train the data?

vamsiduranc commented 4 years ago

@AlexeyAB Can you please let us know if there is any update on this?

AlexeyAB commented 4 years ago

for conf_thresh = 0.25, precision = 0.81, recall = 0.79, F1-score = 0.80 for conf_thresh = 0.25, TP = 49838, FP = 11617, FN = 13504, average IoU = 63.32 %

IoU threshold = 50 %, used Area-Under-Curve for each unique Recall mean average precision (mAP@0.50) = 0.747307, or 74.73 % Total Detection Time: 1646 Seconds

This is normal mAP
What AP for car do you get?
Why did you chose yolov3-tiny.cfg instead of yolov3.cfg or yolov3-spp.cfg?
I would recommend you test this video by using this model that is trained on COCO:
- cfg: https://drive.google.com/open?id=15WhN7W8UZo7-4a0iLkx11Z7_sDVHU4l1
- MS COCO weights: https://drive.google.com/open?id=1ULnPnamS5A6lOgidlBXD24IdxoDAFaaV
If speed and accuracy are good enough - then train this model on your dataset with -clear flag: ./darknet detector train obj.data cd53paspp-gamma.cfg cd53paspp-gamma_final.weights -clear

vamsiduranc commented 4 years ago

@AlexeyAB

Why did you chose yolov3-tiny.cfg instead of yolov3.cfg or yolov3-spp.cfg?

We chose yolov3-tiny.cfg as it was providing more fps than yolov3.cfg. We have not tested/used yolov3-spp.cfg.

I would recommend you test this video by using this model that is trained on COCO:

I have tested the video using CFG and Weight file you provided. Here is the link for recorded video: https://1drv.ms/v/s!Ak40GK_JLROjgvs7n_snsfsAeTQ5wA?e=m0f52z

I have observed while testing the video is that FPS was only 7 (Please see image below). With the cfg you gave, recording took 20 to 25 minutes to process 15 minutes video. We get 3 to 4 times more fps when we run yolov3-tiny.cfg. We need more FPS as we run analytics on live feed but not on recorded video.

One questions please: In your documentation, under section How to improve object detection you have mentioned below point.

Only if you are an expert in neural detection networks - recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so that 1st-[yolo]-layer has anchors larger than 60x60, 2nd larger than 30x30, 3rd remaining. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

Based on this point, we have adjusted anchors in masks in CFG file. Please see here. Can you please look into it and confirm that our approach in tweaking the masks is correct? or if you think we have done a mistake, kindly adjust the masks for this CFG and provide us so we can understand the correct way of adjusting the masks. Thank You!

AlexeyAB commented 4 years ago

Based on this point, we have adjusted anchors in masks in CFG file.

I would recommend you to train with:

mask = 3,4,5,6,7,8
mask = 1,2
mask = 0

I have observed while testing the video is that FPS was only 7 (Please see image below). With the cfg you gave, recording took 20 to 25 minutes to process 15 minutes video. We get 3 to 4 times more fps when we run yolov3-tiny.cfg. We need more FPS as we run analytics on live feed but not on recorded video.

So you should chose

or use yolov3-tiny.cfg with the current bad accuracy and good FPS
or buy GPU RTX2070 / Jetson Xavier and run yolov3-spp.cfg with good accuracy and good FPS

vamsiduranc commented 4 years ago

Hello @AlexeyAB

Hope everything is fine at your end.

We have observed that, with the darknet.exe we compiled using the code (downloaded April 4, 2020) we are not getting xxx_best.weights file. We used to get this _best.weights file in our previous training. Can you please let us know if we need to tweak code to get this _best.weight file generated? .

AlexeyAB commented 4 years ago

@vamsiduranc I don't have such issue. Do you train with -map flag? Can you show chart.png image with Loss & mAP chart?

vamsiduranc commented 4 years ago

@AlexeyAB

Do you train with -map flag?

As suggested by you in the link https://github.com/AlexeyAB/darknet/issues/5075#issuecomment-606674922 we have started training without -map option. Without -map option training is working otherwise it was throwing error message after few iterations... please see here: https://github.com/AlexeyAB/darknet/issues/5075#issuecomment-606509846.

Can you show chart.png image with Loss & mAP chart?

As we have removed -map option, graph is only plotting average loss but not mAP.

So how do we get _best.weight file without using -map command?

AlexeyAB commented 4 years ago

So how do we get _best.weight file without using -map command?

You can't get _best.weights. Just use the _last.weights file, in the most cases it is the best.

Also there were some fixes for training process, you can try to train with -map flag with the latest code, may be it will solve your issue.

vamsiduranc commented 4 years ago

@AlexeyAB

I have downloaded latest code and complied darknet.exe. When we start training using -map flag, we received below error message.

cudaError_t status=cudaDeviceSynchronize() Error in: file: E:/darnet-master/src/convolutional_kernels.cu : forward_convolutional_layer_gpu():line 544:build time 21 2020

CUDA Error:an illegal memory access was encountered

Error Screenshot: error message

Also there were some fixes for training process, you can try to train with -map flag with the latest code, may be it will solve your issue.

This did not work. So I have started training without -map flag. We will see if we receive any error message.

AlexeyAB / darknet

CUDA Error Prev: an illegal memory access was encountered #5075