Open dkubatin opened 4 years ago
Try to use new commit: https://github.com/AlexeyAB/darknet/commit/6878ecc2e2b26d383cf65811b9d9e17375ca14ed
It is also noted that yolo-5l does not use the GPU memory at full power for this particular model. This problem is observed on several PCs.
It doesn't matter.
Great, the problem is gone, thanks!
Hi, @dkubatin and @AlexeyAB Now, It gives the error when it calculates the mAP during training. I'm training yolov3-tiny_3l. The output is like:
calculation mAP (mean average precision)... 4CUDA Error Prev: an illegal memory access was encountered CUDA Error Prev: an illegal memory access was encountered: Success darknet: ./src/utils.c:297: error: Assertion '0' failed. Aborted (core dumped)
@canyilmaz90
nvcc --version
nvidia-smi
@AlexeyAB , before your questions, I think the problem is that I was working on a remote connection. I used ssh -X user@ip
. I had a previous version of the repo stored in my own pc and I moved it to the remote computer, but it gave the same error too. But, in both repos (newest and older) it worked on my own computer. In the remote computer, it works also until the map calculation. I think the problem may be about the ssh connection. If so, do you have any suggestions for it?
Can you calculate the mAP on the same dataset with command ./darknet detector map ... ?
- Yes, I can. Attach your cfg-file in zip
- I think it's not about the cfg-file because I didn't change much, only the classes, filter sizes before yolo layer and the number of iterations, etc. Do you use the latest version of Darknet?
- Yes, I use the latest version. I cloned it yesterday. What params do you use in Makefile?
- GPU=1, CUDNN=1, OPENCV=1, the rest is the same.
nvcc --version
&nvidia-smi
Show first 10 lines when you run any darknet command
- CUDA-version: 10000 (10010) Warning: CUDA-version is lower than Driver-version! , cuDNN: 7.6.4, GPU count: 4
OpenCV version: 4.9.1 net.optimized_memory = 0 batch = 1, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 conv 16 3 x 3/ 1 448 x 448 x 3 -> 448 x 448 x 16 0.173 BF 1 max 2x 2/ 2 448 x 448 x 16 -> 224 x 224 x 16 0.003 BF 2 conv 32 3 x 3/ 1 224 x 224 x 16 -> 224 x 224 x 32 0.462 BF
OpenCV version: 4.9.1
This is very strange, since the latest OpenCV 4.2.0: https://opencv.org/releases/
I think the problem may be about the ssh connection. If so, do you have any suggestions for it?
I think you something doing wrong: incorrect paths, dataset, cfg-file, or didn't recompile darknet on new computer, ... Try to download darknet again
This is very strange, since the latest OpenCV 4.2.0: https://opencv.org/releases/
Indeed! But only the darknet shows it as 4.9.1, and the training without map calculation goes very well, so I didn't mind it. btw, the real version of OpenCV is 3.4.8-pre
I think you something doing wrong: incorrect paths, dataset, cfg-file, or didn't recompile darknet on the new computer, ...
Actually, I cloned it via terminal yesterday and then make it. I mean I didn't copy it to the remote computer. Also recloned and re-make after getting this error. It works in fact until the first map calculation (in this case it's 2000th iteration which is my burn-in number).
But only the darknet shows it as 4.9.1, a
Can you show screenshot?
Also can you show screenshot of the error?
Also attach your cfg-file in zip
Hi @AlexeyAB , sorry that I could not look at this issue for a while because I was very busy at work last week. Here is a screenshot from error:
This screenshot from ./darknet detector map ...
:
And here is the .zip file of my cfg file: global.cfg.zip
An addition! The problem is not the ssh connection. I did another remote connection to another computer and it worked well. Maybe I should rebuild cuda, cudnn, opencv on this pc?
@canyilmaz90
But only the darknet shows it as 4.9.1, a
Can you show screenshot?
obj.data
file./darknet detector map ...
?
saturation = 0
exposure = 0
hue=0
Maybe I should rebuild cuda, cudnn, opencv on this pc?
Try to do this.
Do you get this issue if you train by using only 1 GPU?
yes I get this error anytime I run any ./darknet command
Show screenshot that "darknet shows it as 4.9.1"
Show content of obj.data file
classes=1 train=/media/arge/4TB_64GVNY0/Plate/trainings/Global-v3tiny/train.list valid=/media/arge/4TB_64GVNY0/Plate/trainings/Global-v3tiny/valid.list names=/media/arge/4TB_64GVNY0/Plate/trainings/Global-v3tiny/global.labels backup=/media/arge/4TB_64GVNY0/Plate/trainings/Global-v3tiny/weights
Did you change anything in the source code?
I changed in detector.c, multi gpu synchronization frequency from 4 iterations to 5 iterations:
train_networks(nets, ngpus, train, 4)
==>> train_networks(nets, ngpus, train, 5)
and also map calculation rate from 4 epochs to 1 epoch:
int calc_map_for_each = 4 * train_images_num / (net.batch * net.subdivisions);
==>>
int calc_map_for_each = train_images_num / (net.batch * net.subdivisions);
in utils.c, I uncommented this line:
find_replace(output_path, "/images/", "/labels/", output_path);
but with the same changes, same configurations, and same dataset, the code works well on another pc.
Do you get this message if you use
./darknet detector map ...
?
No. Actually, I don't get this message when training too. Only this time, I forget to add the validation list in the obj.data, but it gives the same error without this message when I train it with a validation list.
Comment these lines instead of setting 0 in cfg file
Ok, I'll do it in the next training.
Ok, I'll try to reinstall Cuda and OpenCV at an appropriate time. thanks a lot for your patience :)
@canyilmaz90 Also try to download the latest Darknet and try to train without your changes in source code, will be there this error?
@AlexeyAB I think I found it! When I increase the number of subdivisions (or decrease the minibatch size), it worked. I think it's about the mini-batch size, however, there is quite enough space in gpu ram. To test it, I tried a classification training which requires much less gpu memory. When the mini-batch size = 64, while calculating top-k score during training, it also threw a similar, but different error: CUDA Error: an illegal memory access was encountered: Resource temporarily unavailable
. But, both detector and classifier worked well with a smaller mini-batch size.
@canyilmaz90
It seems that cuDNN library may allocate some array (~100 MB) on GPU-0 even if you use GPU-1. So it’s better if 10% of the GPU-memory remains free.
Also very strange that this line shows OpenCV 4.9.1
: https://github.com/AlexeyAB/darknet/blob/2a9fe045f3fd385ec61a38c8225945482d0ad7c7/src/image_opencv.cpp#L1338
Can you show screenshot of content of OpenCV version.hpp
file, like this? https://github.com/opencv/opencv/blob/89d3f95a8eea50acbfb4b8db380d5a4dc8a98173/modules/core/include/opencv2/core/version.hpp#L8-L11
opencv/modules/core/include/opencv2/core/version.hpp
opencv/build/include/opencv2/core/version.hpp
opencv/bin/install/include/opencv2/core/version.hpp
in utils.c, I uncommented this line: find_replace(output_path, "/images/", "/labels/", output_path);
Yes, either you can do this. Or you can just put txt-label-files to the /image/ directory.
@AlexeyAB
* Do you use one GPU for training? * Do you run several instances of Darknet on 1 PC? * Do you run training Detector with random=1 in cfg-file?
It seems that cuDNN library may allocate some array (~100 MB) on GPU-0 even if you use GPU-1. So it’s better if 10% of the GPU-memory remains free.
Also very strange that this line shows OpenCV
4.9.1
:Can you show screenshot of content of OpenCV
version.hpp
file, like this? https://github.com/opencv/opencv/blob/89d3f95a8eea50acbfb4b8db380d5a4dc8a98173/modules/core/include/opencv2/core/version.hpp#L8-L11* in `opencv/modules/core/include/opencv2/core/version.hpp` * or in `opencv/build/include/opencv2/core/version.hpp` * or in `opencv/bin/install/include/opencv2/core/version.hpp`
Here is version.hpp file in '/usr/include/opencv2/core/': version.hpp.txt
It's really strange, it shows
#define CV_VERSION_EPOCH 2
#define CV_VERSION_MAJOR 4
#define CV_VERSION_MINOR 9
#define CV_VERSION_REVISION 1
But, I have an idea about it. When I start working in this company, opencv version something like 2.4.9.1 was installed on this pc. Then I installed opencv 3.4.8. So maybe something remains from that time.
@AlexeyAB I think I found it! When I increase the number of subdivisions (or decrease the minibatch size), it worked. I think it's about the mini-batch size, however, there is quite enough space in gpu ram. To test it, I tried a classification training which requires much less gpu memory. When the mini-batch size = 64, while calculating top-k score during training, it also threw a similar, but different error:
CUDA Error: an illegal memory access was encountered: Resource temporarily unavailable
. But, both detector and classifier worked well with a smaller mini-batch size.
@AlexeyAB Can you try training with -map flag and also big mini-batch size like 64,128 etc.?
@AlexeyAB I've just noticed that there is another version.hpp in /bin/local/include/opencv2/core/
and it shows:
#define CV_VERSION_MAJOR 3
#define CV_VERSION_MINOR 4
#define CV_VERSION_REVISION 8
#define CV_VERSION_STATUS "-pre"
@canyilmaz90
It seems that Darknet uses old OpenCV 2.4.9. I Several different versions of OpenCV can interfere with each other if the wrong paths are set, for example, can use the hpp-file from 2.4.9 and the SO-library from the new 3.x If old OpenCV 2.4.9 isn't required, try to delete it and leave only one version 3.x
Can you try training with -map flag and also big mini-batch size like 64,128 etc.?
For some models - yes, I can. It depends on: GPU, model, random-param and network size. On Quadro RTX8000 you can train small model with mini-batch 1024 and -map flag.
Hi all,
Just had the same error while training a yolov4-tiny-custom
with the following .cfg
values:
batch=64
subdivisions=2
width=800
height=640
When I changed the values to:
batch=64
subdivisions=4
width=800
height=640
the issue no longer appeared. The issue appears at the point where it calculates the mAP values while training (after 1000 iterations) and only appears when running with the -map
flag. It really seems to be related to putting the subdivisions=2
. I tried to calculate the mAP manually with ./darknet detector map
of the weights at iteration 900 (as iteration 1000 is not saved yet at the point of crashing).
Error I got:
4CUDA Error: an illegal memory access was encountered: File exists darknet: ./src/utils.c:331: error: Assertion `0' failed.
Makefile
GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=1
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0
ZED_CAMERA_v2_8=0
USE_CPP=0
DEBUG=0
ARCH= -gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=[sm_50,compute_50] \
-gencode arch=compute_52,code=[sm_52,compute_52] \
-gencode arch=compute_61,code=[sm_61,compute_61]
Videocard: Tesla P100
CUDA Version: 10.1
OpenCV version: 2.4.9
cuDNN: 7.6.5
I think the error message is wrong, for some reason. The error message should be Out of memory. If you use -map or small subdivisions - then it requires more memory.
Thanks for the quick reply @AlexeyAB . I'm currently running the training job with subdivions=4
, once this is done, I'll reproduce the error again and give you a more complete error output. Memory should normally not be an issue, as there was only around 12 000MiB of the 16 280MiB GPU memory in-use during training.
Hi @AlexeyAB ! I ran it again to get the full output. Here you go:
(next mAP calculation at 1000 iterations)
1000: 0.211191, 0.166555 avg loss, 0.002610 rate, 1.027363 seconds, 64000 images, 5.460727 hours left
4CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:331: error: Assertion `0' failed.
calculation mAP (mean average precision)...
Detection layer: 30 - type = 28
Detection layer: 37 - type = 28
CUDA status Error: file: ./src/network_kernels.cu : () : line: 720 : build time: Jul 1 2021 - 11:00:33
CUDA Error: an illegal memory access was encountered
Seems like @stephanecharette has the same issue in #7850 .
This issue has so many things dating from over a year ago that I didn't want to add to it. The issue I ran into is very specific and only started happening with a commit from a few days ago. Issue #7850 documents the exact commit where this problem started happening, but yes, it looks to be the same as what @JeremyKeusters reported, even down to the extra leading 4
in the error message: 4CUDA Error: an illegal memory access was encountered
.
@stephanecharette Hi, I fixed this bug. Try the latest commit ( https://github.com/AlexeyAB/darknet/commit/9c9232d1c3f0f80e40bf347643a542903d6703ca and https://github.com/AlexeyAB/darknet/commit/b2cb64dffbcf706ac9f1d12d7fe699c40eacc40b )
Hi @AlexeyAB , thanks for the fix. I will train again sometime this week with the latest commit to verify that the bug was fixed on my end too.
Hi @AlexeyAB ! I'm on 2418fa7 and I still have this issue.. Error message is slightly different (note the line number):
(next mAP calculation at 1000 iterations)
1000: 0.213937, 0.263404 avg loss, 0.002610 rate, 0.817044 seconds, 64000 images, 5.043064 hours left
4CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:331: error: Assertion `0' failed.
calculation mAP (mean average precision)...
Detection layer: 30 - type = 28
Detection layer: 37 - type = 28
CUDA status Error: file: ./src/network_kernels.cu : () : line: 735 : build time: Jul 12 2021 - 14:35:49
CUDA Error: an illegal memory access was encountered
@JeremyKeusters Hi,
./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/dog.jpg
CUDA-version: 10000 (10000), cuDNN: 7.4.2, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV version: 4.2.0
0 : compute_capability = 750, cudnn_half = 1, GPU: GeForce RTX 2070
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
layer filters size/strd(dil) input output
Hi @AlexeyAB ,
Here's the information you requested, let me know if you need any additional information. As I already said, when I set the subdivisions to 4, the issue disappears.
- Did you recompile Darknet?
Yes. I did however make 2 changes to the existing code:
Save the weights every 100 iterations by commenting these 3 lines: https://github.com/AlexeyAB/darknet/blob/d669680879f72e58a5bc4d8de98c2e3c0aab0b62/src/detector.c#L385-L387 in and by out-commenting this line and replacing i
with iteration
: https://github.com/AlexeyAB/darknet/blob/d669680879f72e58a5bc4d8de98c2e3c0aab0b62/src/detector.c#L384
Do the validation every 100 iterations by commenting this line: https://github.com/AlexeyAB/darknet/blob/d669680879f72e58a5bc4d8de98c2e3c0aab0b62/src/detector.c#L302 and adding
# Allow validation every 100 lines
calc_map_for_each = 100;
- Can you share cfg-file?
Sure: yolov4-tiny-issue-4655.cfg.zip
- What command do you use?
./darknet detector train data/issue-4655.data cfg/yolov4-tiny-issue-4655.cfg yolov4-tiny.conv.29 -map -dont_show &> logs/output.log &
- Can you show such lines?
CUDA-version: 10010 (10010), cuDNN: 7.6.5, GPU count: 1
OpenCV version: 4.9.1
0 : compute_capability = 600, cudnn_half = 0, GPU: Tesla P100-PCIE-16GB
layer filters size/strd(dil) input output
0 conv 32 3 x 3/ 2 640 x 640 x 3 -> 320 x 320 x 32 0.177 BF
1 conv 64 3 x 3/ 2 320 x 320 x 32 -> 160 x 160 x 64 0.944 BF
2 conv 64 3 x 3/ 1 160 x 160 x 64 -> 160 x 160 x 64 1.887 BF
3 route 2 1/2 -> 160 x 160 x 32
4 conv 32 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 32 0.472 BF
5 conv 32 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 32 0.472 BF
6 route 5 4 -> 160 x 160 x 64
7 conv 64 1 x 1/ 1 160 x 160 x 64 -> 160 x 160 x 64 0.210 BF
8 route 2 7 -> 160 x 160 x 128
9 max 2x 2/ 2 160 x 160 x 128 -> 80 x 80 x 128 0.003 BF
10 conv 128 3 x 3/ 1 80 x 80 x 128 -> 80 x 80 x 128 1.887 BF
11 route 10 1/2 -> 80 x 80 x 64
12 conv 64 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 64 0.472 BF
13 conv 64 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 64 0.472 BF
14 route 13 12 -> 80 x 80 x 128
15 conv 128 1 x 1/ 1 80 x 80 x 128 -> 80 x 80 x 128 0.210 BF
16 route 10 15 -> 80 x 80 x 256
17 max 2x 2/ 2 80 x 80 x 256 -> 40 x 40 x 256 0.002 BF
18 conv 256 3 x 3/ 1 40 x 40 x 256 -> 40 x 40 x 256 1.887 BF
19 route 18 1/2 -> 40 x 40 x 128
20 conv 128 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 128 0.472 BF
21 conv 128 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 128 0.472 BF
22 route 21 20 -> 40 x 40 x 256
23 conv 256 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 256 0.210 BF
24 route 18 23 -> 40 x 40 x 512
25 max 2x 2/ 2 40 x 40 x 512 -> 20 x 20 x 512 0.001 BF
26 conv 512 3 x 3/ 1 20 x 20 x 512 -> 20 x 20 x 512 1.887 BF
27 conv 256 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 256 0.105 BF
28 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF
29 conv 45 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 45 0.018 BF
30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
31 route 27 -> 20 x 20 x 256
32 conv 128 1 x 1/ 1 20 x 20 x 256 -> 20 x 20 x 128 0.026 BF
33 upsample 2x 20 x 20 x 128 -> 40 x 40 x 128
34 route 33 23 -> 40 x 40 x 384
35 conv 256 3 x 3/ 1 40 x 40 x 384 -> 40 x 40 x 256 2.831 BF
36 conv 45 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 45 0.037 BF
37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
Total BFLOPS 16.098
avg_outputs = 712105
Allocate additional workspace_size = 26.22 MB
0 : compute_capability = 600, cudnn_half = 0, GPU: Tesla P100-PCIE-16GB
layer filters size/strd(dil) input output
0 conv 32 3 x 3/ 2 640 x 640 x 3 -> 320 x 320 x 32 0.177 BF
1 conv 64 3 x 3/ 2 320 x 320 x 32 -> 160 x 160 x 64 0.944 BF
2 conv 64 3 x 3/ 1 160 x 160 x 64 -> 160 x 160 x 64 1.887 BF
3 route 2 1/2 -> 160 x 160 x 32
4 conv 32 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 32 0.472 BF
5 conv 32 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 32 0.472 BF
6 route 5 4 -> 160 x 160 x 64
7 conv 64 1 x 1/ 1 160 x 160 x 64 -> 160 x 160 x 64 0.210 BF
8 route 2 7 -> 160 x 160 x 128
9 max 2x 2/ 2 160 x 160 x 128 -> 80 x 80 x 128 0.003 BF
10 conv 128 3 x 3/ 1 80 x 80 x 128 -> 80 x 80 x 128 1.887 BF
11 route 10 1/2 -> 80 x 80 x 64
12 conv 64 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 64 0.472 BF
13 conv 64 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 64 0.472 BF
14 route 13 12 -> 80 x 80 x 128
15 conv 128 1 x 1/ 1 80 x 80 x 128 -> 80 x 80 x 128 0.210 BF
16 route 10 15 -> 80 x 80 x 256
17 max 2x 2/ 2 80 x 80 x 256 -> 40 x 40 x 256 0.002 BF
18 conv 256 3 x 3/ 1 40 x 40 x 256 -> 40 x 40 x 256 1.887 BF
19 route 18 1/2 -> 40 x 40 x 128
20 conv 128 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 128 0.472 BF
21 conv 128 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 128 0.472 BF
22 route 21 20 -> 40 x 40 x 256
23 conv 256 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 256 0.210 BF
24 route 18 23 -> 40 x 40 x 512
25 max 2x 2/ 2 40 x 40 x 512 -> 20 x 20 x 512 0.001 BF
26 conv 512 3 x 3/ 1 20 x 20 x 512 -> 20 x 20 x 512 1.887 BF
27 conv 256 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 256 0.105 BF
28 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF
29 conv 45 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 45 0.018 BF
30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
31 route 27 -> 20 x 20 x 256
32 conv 128 1 x 1/ 1 20 x 20 x 256 -> 20 x 20 x 128 0.026 BF
33 upsample 2x 20 x 20 x 128 -> 40 x 40 x 128
34 route 33 23 -> 40 x 40 x 384
35 conv 256 3 x 3/ 1 40 x 40 x 384 -> 40 x 40 x 256 2.831 BF
36 conv 45 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 45 0.037 BF
37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
Total BFLOPS 16.098
avg_outputs = 712105
Allocate additional workspace_size = 606.13 MB
Loading weights from yolov4-tiny.conv.29... Prepare additional network for mAP calculation...
net.optimized_memory = 0
mini_batch = 1, batch = 2, time_steps = 1, train = 0
Create CUDA-stream - 0
Create cudnn-handle 0
nms_kind: greedynms (1), beta = 0.600000
nms_kind: greedynms (1), beta = 0.600000
yolov4-tiny-issue-4655
net.optimized_memory = 0
mini_batch = 32, batch = 64, time_steps = 1, train = 1
nms_kind: greedynms (1), beta = 0.600000
nms_kind: greedynms (1), beta = 0.600000
Done! Loaded 29 layers from weights-file
Create 6 permanent cpu-threads
Hi @JeremyKeusters, I had the same problem when using my model for inference: `CUDA status Error: file: ./src/network_kernels.cu : () : line: 735 : build time: Aug 19 2021 - 09:48:00
CUDA Error: an illegal memory access was encountered /home/gise-2/anaconda3/envs/platformtest/bin/python: check_error: Unknown error 1513545619`
It turns out I was forcing the model to be loaded and used on a specific GPU with tf.device('/device:GPU:1'):
I commented this line and now it's working as expected.
@AlexeyAB, weirdly enough the illegal memory access error was raised when I forced the model on a gpu. Even forcing it on GPU:0 raises the error while the model naturally loads and run on the first gpu. Any idea why?
@GeoffSion
with tf.device('/device:GPU:1'):
What framework do you use for YOLOv4, is it Darknet or TensorFlow?
with tf.device('/device:GPU:1'):
- is suitable for TensorFlowset_gpu(1)
or darknet.set_gpu(1)
is suitable for Darknet - https://github.com/AlexeyAB/darknet/blob/b8dceb7ed055b1ab2094bdbd0756b61473db3ef6/darknet.py#L191@AlexeyAB Good point, I'm using TensorFlow I tried darknet.set_gpu(1) and it worked Thanks for your answer! I hope it will help others with this issue
@AlexeyAB, hi! After the last repository update during training, the following error appears:
CUDA Error Prev: an illegal memory access was encountered
CUDA Error Prev: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:297: error: Assertion 0 failed.
Makefile:
Videocard: RTX 2080Ti CUDA Version: 10.1 OpenCV version: 3.4.6 cuDNN: 7.6.0
An error appears when training yolov3-5l and yolov3. I did not check on other configs.
It is also noted that yolo-5l does not use the GPU memory at full power for this particular model. This problem is observed on several PCs.