训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题

GuoXu-booo commented 2 years ago

尝试减少batch和input_size,没有效果，之前可以正常训练，现在报错： /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed. 2022-03-03 14:35:41 | ERROR | yolox.models.yolo_head:328 - OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. 2022-03-03 14:35:41 | INFO | yolox.core.trainer:196 - Training of experiment is done and the best AP is 0.00 2022-03-03 14:35:41 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (32), thread 'MainThread' (140133772625728): Traceback (most recent call last):

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 322, in get_losses imgs, └

File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead.get_assignments at 0x7f731c7157b8>

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 505, in get_assignments ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask) │ │ │ │ │ │ └ │ │ │ │ │ └ 2 │ │ │ │ └ │ │ │ └ │ │ └ │ └ <function YOLOXHead.dynamic_k_matching at 0x7f731c715950> └ YOLOXHead( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 128, kernel_size=...

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 616, in dynamic_k_matching dynamic_ks = dynamic_ks.tolist() │ └ <method 'tolist' of 'torch._C._TensorBase' objects> └

RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "train.py", line 146, in args=(exp, args), │ └ Namespace(batch_size=8, cache=False, ckpt='/project/train/models/weight/yolox_s.pth', devices=0, dist_backend='nccl', dist_ur... └ ╒═══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...

File "/project/train/src_repo/YOLOX/tools/../yolox/core/launch.py", line 98, in launch main_func(*args) │ └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════... └ <function main at 0x7f731c747ae8>

File "train.py", line 124, in main trainer.train() │ └ <function Trainer.train at 0x7f72d48b2bf8> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 74, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x7f72d48d3f28> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 83, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x7f731c745950> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 89, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x7f731c7459d8> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 103, in train_one_iter outputs = self.model(inps, targets) │ │ │ └ │ │ └ │ └ YOLOX( │ (backbone): YOLOPAFPN( │ (backbone): CSPDarknet( │ (stem): Focus( │ (conv): BaseConv( │ (conv): ... └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) │ │ │ └ {} │ │ └ │ └ <function YOLOX.forward at 0x7f731c715c80> └ YOLOX( (backbone): YOLOPAFPN( (backbone): CSPDarknet( (stem): Focus( (conv): BaseConv( (conv): ...

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolox.py", line 35, in forward fpn_outs, targets, x │ │ └ │ └ └

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) │ │ │ └ {} │ │ └ │ └ <function YOLOXHead.forward at 0x7f731c715510> └ YOLOXHead( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 128, kernel_size=...

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 203, in forward dtype=xin[0].dtype, └

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 352, in get_losses "cpu",

File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead.get_assignments at 0x7f731c7157b8>

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 446, in get_assignments gt_bboxes_per_image = gt_bboxes_per_image.cpu().float() │ └ <method 'cpu' of 'torch._C._TensorBase' objects> └

RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered

GuoXu-booo commented 2 years ago

Why did this error occur55555

wangyirui commented 2 years ago

I have the same issue when training ByteTrack with YOLOX-X

FateScript commented 2 years ago

Did you modify any code?

GuoXu-booo commented 2 years ago

Did you modify any code?

The code was not modified，I found the problem in reg_conv module through debug

GuoXu-booo commented 2 years ago

Did you modify any code?

GuoXu-booo commented 2 years ago

Did you modify any code?

Is there something wrong with my graphics card？Or float16？It can still run on the server before。

ELongking commented 2 years ago

tesla v100, face the same problems even after trying to change batchsize, suddenly encountered it and there was no such error reported before, besides find the reg_feat as NaN tensor like the former one

ilmoney commented 2 years ago

have anyone solve this problem ，

Did you modify any code?

Is there something wrong with my graphics card？Or float16？It can still run on the server before。

have you solve this problem, i met the same problem on v100,but i don't know the reason

ilmoney commented 2 years ago

tesla v100, face the same problems even after trying to change batchsize, suddenly encountered it and there was no such error reported before, besides find the reg_feat as NaN tensor like the former one

i met the same problem on v100,but i don't know the reason,can you provide some solutions,i am at a loss

ilmoney commented 2 years ago

Did you modify any code?

i modify my code and met this problem,can you provide some solutions,i really don't know the reason

bitzyz commented 2 years ago

I face the same question , anyone can help me ? log: /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed. THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered 2022-04-13 16:20:52 | ERROR | yolox.models.yolo_head:330 - OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. 2022-04-13 16:20:52 | INFO | yolox.core.trainer:189 - Training of experiment is done and the best AP is 0.00 2022-04-13 16:20:52 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (24745), thread 'MainThread' (139972345780032): Traceback (most recent call last):

File "/home/zyz/Documents/YOLOX/yolox/models/yolo_head.py", line 324, in get_losses imgs, └

File "/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead.get_assignments at 0x7f4d655dc2f0>

File "/home/zyz/Documents/YOLOX/yolox/models/yolo_head.py", line 519, in get_assignments clspreds.sqrt_(), gt_cls_perimage, reduction="none" │ │ └ │ └ <method 'sqrt' of 'torch._C._TensorBase' objects> └

File "/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/nn/functional.py", line 2759, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) │ │ │ │ │ │ │ └ 0 │ │ │ │ │ │ └ None │ │ │ │ │ └ │ │ │ │ └ │ │ │ └ │ │ └ <module 'torch._C._nn'> │ └ <module 'torch._C' from '/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/_C.cpython-37m-x86_64-linux-gnu.so'> └ <module 'torch' from '/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/init.py'>

RuntimeError: CUDA error: device-side assert triggered

@FateScript

ilmoney commented 2 years ago

maybe you can check you environment,and is the version of python 3.8?

kuazhangxiaoai commented 2 years ago

I'm confused about this problem too

wangyirui commented 2 years ago

我开始也遇到同样的问题，是Tesla V100。后来把Batch size从48一路尝试降到16后就没出现过这个问题了。GPU内存应该是足够的，不知道为什么batch size会导致这个错误。不知道会不会因为大batch在做mixed precision的时候导致gradient有问题？因为我观测到报错前有些iter会出现nan的loss

Yuanyang-Zhu commented 2 years ago

Setting the learning rate a little smaller can solve this problem. In fact, I succeeded.

ChiefGodMan commented 2 years ago

Setting the learning rate a little smaller can solve this problem. In fact, I succeeded.

So why large lr may cause this problem?

gjd2017 commented 2 years ago

同样遇到该问题，怎么解决的哇？

GuoXu-booo commented 2 years ago

About this problem, I finally found that the gpu is broken. You need to change the CPU to run the yolox code to determine whether there is a problem with your code

lmw0320 commented 2 years ago

Had there anyone fixed it? I met the same problem。。。Any I have tried to reduce the lr and batch-size，still useless...... And that's my GPU device shown below: I run my code in docker container. And I have run others code using the GPU without any problem... Also, I have run the same code with same setting of batch-size and lr before in similar docker container successfully.

GuoXu-booo commented 2 years ago

有人修过吗？我遇到了同样的问题。。。任何我尝试过减少lr和batch-size，仍然没用...... 这是我的GPU设备，如下所示：我在docker容器中运行我的代码。而且我已经使用 GPU 运行其他代码没有任何问题...... 此外，我之前在类似的 docker 容器中成功地运行了具有相同批处理大小和 lr 设置的相同代码。

You first switch to CPU to run YOLOX, to confirm that it is not a problem with the code

lmw0320 commented 2 years ago

how to switch to CPU? by any simple parameter setting?

iodncookie commented 2 years ago

i also met this question, after i add siou loss and uesed it...

nanhai78 commented 2 years ago

I met same problem.Then, I can train normally after turning off fp16

lawrencekiba commented 2 years ago

I met same problem.Then, I can train normally after turning off fp16

This worked for me, thank you

gachiemchiep commented 2 years ago

Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....

flyingfish7777 commented 2 years ago

I met same problem.Then, I can train normally after turning off fp16

omg!!!!!thank you ！ useful！

chairc commented 1 year ago

Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....

Have you ever tried to change FOCUS to a normal 3*3 convolution?

nuaaaaa commented 1 year ago

可以试一下加载预训练模型或者将bn改成gn

---原始邮件--- 发件人: "Yu @.> 发送时间: 2022年11月7日(周一) 下午4:37 收件人: @.>; 抄送: @.**@.>; 主题: Re: [Megvii-BaseDetection/YOLOX] 训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题 (Issue #1161)

Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....

Have you ever tried to change FOCUS to a normal 3*3 convolution?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

QAQEthan commented 1 year ago

i also met this question, after i add siou loss and uesed it...

What did you do to fix it?

QAQEthan commented 1 year ago

@gachiemchiep Have you solved this problem?

fengkh commented 1 year ago

我认为是因为做了mosaic增强导致的，mosaic有些增强猛了会导致随机组合的图出来的gt超出了图片的范围，我现在训练都关mosaic的，就几乎没出过这个问题。

fengkh commented 1 year ago

同样遇到该问题，怎么解决的哇？

Reducing lr solves but I dont know why it happens.

Megvii-BaseDetection / YOLOX

训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题 #1161