train error. AttributeError: 'DistributedDataParallel' object has no attribute 'get_losses'

iodncookie commented 2 years ago

cmd: python tools/train.py -expn yolox_tiny_plate -f exps/example/yolox_obb/yolox_tiny_my_plate.py -d 2 -b 8 '2022-10-12 11:34:10 | INFO | yolox.core.trainer:194 - ---> start train epoch1 2022-10-12 11:34:10 | INFO | yolox.core.trainer:189 - Training of experiment is done and the best AP is 0.00 2022-10-12 11:34:10 | ERROR | yolox.core.launch:147 - An error has been caught in function '_distributed_worker', process 'SpawnProcess-1' (184), thread 'MainThread' (140063261951680): Traceback (most recent call last):

File "", line 1, in File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) │ │ └ 3 │ └ 18 └ <function _main at 0x7f6304972dc0> File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/spawn.py", line 129, in _main return self._bootstrap(parent_sentinel) │ │ └ 3 │ └ <function BaseProcess._bootstrap at 0x7f6304a96f70> └ File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() │ └ <function BaseProcess.run at 0x7f6304a965e0> └ File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) │ │ │ │ │ └ {} │ │ │ │ └ │ │ │ └ (<function _distributed_worker at 0x7f6140b5b820>, 0, (<function main at 0x7f614038f9d0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:4... │ │ └ │ └ <function _wrap at 0x7f620ddbeb80> └ File "/data/anaconda3/envs/plate_yolox/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) │ │ └ (<function main at 0x7f614038f9d0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:40600', (╒══════════════════╤══════════════════════════... │ └ 0 └ <function _distributed_worker at 0x7f6140b5b820>

File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/launch.py", line 147, in _distributed_worker main_func(*args) │ └ (╒══════════════════╤═════════════════════════════════════════════════════════════════════════════════════════════════════╕ │ │... └ <function main at 0x7f614038f9d0>

File "/home/a-bamboo/repositories/YOLOX_OBB/tools/train.py", line 108, in main trainer.train() │ └ <function Trainer.train at 0x7f6140899310> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>

File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 74, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x7f61408999d0> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>

File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 83, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x7f61403da4c0> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>

File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 89, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x7f61403da550> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>

File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 104, in train_one_iter outputs = self.model.get_losses(targets, inps) │ │ │ └ tensor([[[[114., 114., 114., ..., 160., 160., 155.], │ │ │ [114., 114., 114., ..., 136., 135., 147.], │ │ │ [114., ... │ │ └ tensor([[[ 0.0000e+00, 3.2303e+02, 1.9024e+02, 1.4952e+02, 4.9818e+01, │ │ -3.7145e-02], │ │ [ 0.0000e+00, 6.... │ └ DistributedDataParallel( │ (module): Model( │ (model): Sequential( │ (0): Conv( │ (conv): Conv2d(3, 16, kernel_si... └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>

File "/data/anaconda3/envs/plate_yolox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in getattr raise AttributeError("'{}' object has no attribute '{}'".format(

AttributeError: 'DistributedDataParallel' object has no attribute 'get_losses' `

ps: when i use single gpu training, every thing is fine, but two gpu device training meet this error.

D-EEPlearning commented 1 year ago

请问您解决了吗？

zzz2006 commented 1 year ago

请问您解决了吗？

我在get_losses前加上module可以多卡训练，但训练10个epochs后卡住了

DDGRCF commented 1 year ago

可能是因为我滤空图的原因，在配置文件的允许空图改为true试试看

zzz2006 @.***> 于2023年9月21日周四 20:55写道：

请问您解决了吗？

我在get_losses前加上module可以多卡训练，但训练10个epochs后卡住了

— Reply to this email directly, view it on GitHub https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1729509981, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zzz2006 commented 1 year ago

您好，我看了一下配置文件，有一个empty_ignore 是True，请问您指的是这个吗，这里设置的是True，但在训练dota数据集10个epochs后，计算完精度和时间后就卡住了，查看GPU发现只有一个GPU100%，另一个停止了。

可能是因为我滤空图的原因，在配置文件的允许空图改为true试试看 zzz2006 @.> 于2023年9月21日周四 20:55写道： … 请问您解决了吗？我在get_losses前加上module可以多卡训练，但训练10个epochs后卡住了 — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.>

DDGRCF commented 1 year ago

抱歉，你可以设置为false

zzz2006 @.***>于2023年9月22日周五01:37写道：

您好，我看了一下配置文件，有一个empty_ignore 是True，请问您指的是这个吗，这里设置的是True，但在训练dota数据集10个epochs后，计算完精度和时间后就卡住了，查看GPU发现只有一个GPU100%，另一个停止了。 [image: image] https://user-images.githubusercontent.com/77568152/269705766-a63aa1a8-3cf9-4683-9ac9-2571794c85c1.png [image: image] https://user-images.githubusercontent.com/77568152/269706606-0e307bef-e208-49d9-8ad0-90c1caf07889.png

可能是因为我滤空图的原因，在配置文件的允许空图改为true试试看 zzz2006 @.

> 于2023年9月21日周四 20:55写道： … <#m319269843106121983> 请问您解决了吗？我在get_losses前加上module可以多卡训练，但训练10个epochs后卡住了 — Reply to this email directly, view it on GitHub <#30 (comment) https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1729509981>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.>

— Reply to this email directly, view it on GitHub https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1730021093, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3AFVYDIWNW3B4OWP7VLX3R3L5ANCNFSM6AAAAAARC3VYHI . You are receiving this because you commented.Message ID: @.***>

zzz2006 commented 1 year ago

我把empty_ignore设置成False之后还是一样的问题

抱歉，你可以设置为false zzz2006 @.>于2023年9月22日周五01:37写道： … 您好，我看了一下配置文件，有一个empty_ignore 是True，请问您指的是这个吗，这里设置的是True，但在训练dota数据集10个epochs后，计算完精度和时间后就卡住了，查看GPU发现只有一个GPU100%，另一个停止了。 [image: image] https://user-images.githubusercontent.com/77568152/269705766-a63aa1a8-3cf9-4683-9ac9-2571794c85c1.png [image: image] https://user-images.githubusercontent.com/77568152/269706606-0e307bef-e208-49d9-8ad0-90c1caf07889.png 可能是因为我滤空图的原因，在配置文件的允许空图改为true试试看 zzz2006 @. > 于2023年9月21日周四 20:55写道： … <#m319269843106121983> 请问您解决了吗？我在get_losses前加上module可以多卡训练，但训练10个epochs后卡住了 — Reply to this email directly, view it on GitHub <#30 (comment) <#30 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3AFVYDIWNW3B4OWP7VLX3R3L5ANCNFSM6AAAAAARC3VYHI . You are receiving this because you commented.Message ID: @.>

zzz2006 commented 1 year ago

我用其他代码在两块GPU上训练没有问题，环境应该没有问题，检查trainer.py代码各个部分也没发现有什么问题，搞不懂了

DDGRCF commented 1 year ago

不好意思，对于这个问题，我目前也没有头绪，但我怀疑是数据加载部分的bug。因为最近在找工作没有时间，可能得过段时间再修复。

zzz2006 @.***> 于2023年9月22日周五 12:20写道：

我用其他代码在两块GPU上训练没有问题，环境应该没有问题，检查trainer.py代码各个部分也没发现有什么问题，搞不懂了

— Reply to this email directly, view it on GitHub https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1730780707, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ABV37H62WYDLQM4FHTX3UG2JANCNFSM6AAAAAARC3VYHI . You are receiving this because you commented.Message ID: @.***>

DDGRCF / YOLOX_OBB

train error. AttributeError: 'DistributedDataParallel' object has no attribute 'get_losses' #30