Open iodncookie opened 2 years ago
请问您解决了吗?
请问您解决了吗?
我在get_losses前加上module可以多卡训练,但训练10个epochs后卡住了
可能是因为我滤空图的原因,在配置文件的允许空图改为true试试看
zzz2006 @.***> 于2023年9月21日周四 20:55写道:
请问您解决了吗?
我在get_losses前加上module可以多卡训练,但训练10个epochs后卡住了
— Reply to this email directly, view it on GitHub https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1729509981, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
您好,我看了一下配置文件,有一个empty_ignore 是True,请问您指的是这个吗,这里设置的是True,但在训练dota数据集10个epochs后,计算完精度和时间后就卡住了,查看GPU发现只有一个GPU100%,另一个停止了。
可能是因为我滤空图的原因,在配置文件的允许空图改为true试试看 zzz2006 @.> 于2023年9月21日周四 20:55写道: … 请问您解决了吗? 我在get_losses前加上module可以多卡训练,但训练10个epochs后卡住了 — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.>
抱歉,你可以设置为false
zzz2006 @.***>于2023年9月22日 周五01:37写道:
您好,我看了一下配置文件,有一个empty_ignore 是True,请问您指的是这个吗,这里设置的是True,但在训练dota数据集10个epochs后,计算完精度和时间后就卡住了,查看GPU发现只有一个GPU100%,另一个停止了。 [image: image] https://user-images.githubusercontent.com/77568152/269705766-a63aa1a8-3cf9-4683-9ac9-2571794c85c1.png [image: image] https://user-images.githubusercontent.com/77568152/269706606-0e307bef-e208-49d9-8ad0-90c1caf07889.png
可能是因为我滤空图的原因,在配置文件的允许空图改为true试试看 zzz2006 @.
> 于2023年9月21日周四 20:55写道: … <#m319269843106121983> 请问您解决了吗? 我在get_losses前加上module可以多卡训练,但训练10个epochs后卡住了 — Reply to this email directly, view it on GitHub <#30 (comment) https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1729509981>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.>
— Reply to this email directly, view it on GitHub https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1730021093, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3AFVYDIWNW3B4OWP7VLX3R3L5ANCNFSM6AAAAAARC3VYHI . You are receiving this because you commented.Message ID: @.***>
我把empty_ignore设置成False之后还是一样的问题
抱歉,你可以设置为false zzz2006 @.>于2023年9月22日 周五01:37写道: … 您好,我看了一下配置文件,有一个empty_ignore 是True,请问您指的是这个吗,这里设置的是True,但在训练dota数据集10个epochs后,计算完精度和时间后就卡住了,查看GPU发现只有一个GPU100%,另一个停止了。 [image: image] https://user-images.githubusercontent.com/77568152/269705766-a63aa1a8-3cf9-4683-9ac9-2571794c85c1.png [image: image] https://user-images.githubusercontent.com/77568152/269706606-0e307bef-e208-49d9-8ad0-90c1caf07889.png 可能是因为我滤空图的原因,在配置文件的允许空图改为true试试看 zzz2006 @. > 于2023年9月21日周四 20:55写道: … <#m319269843106121983> 请问您解决了吗? 我在get_losses前加上module可以多卡训练,但训练10个epochs后卡住了 — Reply to this email directly, view it on GitHub <#30 (comment) <#30 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI https://github.com/notifications/unsubscribe-auth/APFM3ACWLKP5VPWSO5JAJQTX3Q2KJANCNFSM6AAAAAARC3VYHI . You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3AFVYDIWNW3B4OWP7VLX3R3L5ANCNFSM6AAAAAARC3VYHI . You are receiving this because you commented.Message ID: @.>
我用其他代码在两块GPU上训练没有问题,环境应该没有问题,检查trainer.py代码各个部分也没发现有什么问题,搞不懂了
不好意思,对于这个问题,我目前也没有头绪,但我怀疑是数据加载部分的bug。因为最近在找工作没有时间,可能得过段时间再修复。
zzz2006 @.***> 于2023年9月22日周五 12:20写道:
我用其他代码在两块GPU上训练没有问题,环境应该没有问题,检查trainer.py代码各个部分也没发现有什么问题,搞不懂了
— Reply to this email directly, view it on GitHub https://github.com/DDGRCF/YOLOX_OBB/issues/30#issuecomment-1730780707, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFM3ABV37H62WYDLQM4FHTX3UG2JANCNFSM6AAAAAARC3VYHI . You are receiving this because you commented.Message ID: @.***>
cmd: python tools/train.py -expn yolox_tiny_plate -f exps/example/yolox_obb/yolox_tiny_my_plate.py -d 2 -b 8 '2022-10-12 11:34:10 | INFO | yolox.core.trainer:194 - ---> start train epoch1 2022-10-12 11:34:10 | INFO | yolox.core.trainer:189 - Training of experiment is done and the best AP is 0.00 2022-10-12 11:34:10 | ERROR | yolox.core.launch:147 - An error has been caught in function '_distributed_worker', process 'SpawnProcess-1' (184), thread 'MainThread' (140063261951680): Traceback (most recent call last):
File "", line 1, in
File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
│ │ └ 3
│ └ 18
└ <function _main at 0x7f6304972dc0>
File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
return self._bootstrap(parent_sentinel)
│ │ └ 3
│ └ <function BaseProcess._bootstrap at 0x7f6304a96f70>
└
File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
│ └ <function BaseProcess.run at 0x7f6304a965e0>
└
File "/data/anaconda3/envs/plate_yolox/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, *self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └
│ │ │ └ (<function _distributed_worker at 0x7f6140b5b820>, 0, (<function main at 0x7f614038f9d0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:4...
│ │ └
│ └ <function _wrap at 0x7f620ddbeb80>
└
File "/data/anaconda3/envs/plate_yolox/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, args)
│ │ └ (<function main at 0x7f614038f9d0>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:40600', (╒══════════════════╤══════════════════════════...
│ └ 0
└ <function _distributed_worker at 0x7f6140b5b820>
File "/home/a-bamboo/repositories/YOLOX_OBB/tools/train.py", line 108, in main trainer.train() │ └ <function Trainer.train at 0x7f6140899310> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>
File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 74, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x7f61408999d0> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>
File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 83, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x7f61403da4c0> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>
File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 89, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x7f61403da550> └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>
File "/home/a-bamboo/repositories/YOLOX_OBB/yolox/core/trainer.py", line 104, in train_one_iter outputs = self.model.get_losses(targets, inps) │ │ │ └ tensor([[[[114., 114., 114., ..., 160., 160., 155.], │ │ │ [114., 114., 114., ..., 136., 135., 147.], │ │ │ [114., ... │ │ └ tensor([[[ 0.0000e+00, 3.2303e+02, 1.9024e+02, 1.4952e+02, 4.9818e+01, │ │ -3.7145e-02], │ │ [ 0.0000e+00, 6.... │ └ DistributedDataParallel( │ (module): Model( │ (model): Sequential( │ (0): Conv( │ (conv): Conv2d(3, 16, kernel_si... └ <yolox.core.trainer.Trainer object at 0x7f61403a7430>
File "/data/anaconda3/envs/plate_yolox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in getattr raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'get_losses' `
ps: when i use single gpu training, every thing is fine, but two gpu device training meet this error.