Bin-ze / BEVFormer_segmentation_detection

Implemented BEVFormer support for BEV segmentation
Apache License 2.0
88 stars 7 forks source link

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) #18

Closed admyxs closed 6 months ago

admyxs commented 6 months ago

请问我在服务器上多卡并行训练就会出现这个错误,而本地多卡mini训练就不会出现这个错误,究竟是什么原因呢?

发生错误的位置是:第一个epoch训练完,在执行test的时候,似乎在执行with torch_grad():的时候出错的 ` dataset = data_loader.dataset prog_bar = mmcv.ProgressBar(len(dataset)) for i, data in enumerate(data_loader):

with torch.no_grad():
    in_data = {i: j for i, j in data.items() if 'img' in i}
    result = model(return_loss=False, rescale=True, **in_data)

` 感谢回答,万分感谢!!!

2024-01-05 13:21:28,371 - mmdet - INFO - Saving checkpoint at 1 epochs [ ] 0/6019, elapsed: 0s, ETA:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1801442) of binary: /home/lide/anaconda3/envs/bevformer/bin/python Traceback (most recent call last): File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lide/anaconda3/envs/bevformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          ./tools/train.py FAILED

=================================================== Root Cause: [0]: time: 2024-01-05_13:22:03 rank: 0 (local_rank: 0) exitcode: -9 (pid: 1801442) error_file: <N/A> msg: "Signal 9 (SIGKILL) received by PID 1801442"

Other Failures:

***************************************************
Bin-ze commented 6 months ago

你上面显示的信息不够全面,可以提供更加细致的错误日志嘛 要包含你使用的是什么配置文件,整个的错误信息

admyxs commented 6 months ago

这个就是所有的报错信息啦,不过我似乎已经找到原因了,可能是占用内存(不是显存)太高了 我本地和服务器跑mini数据集都不出错,服务器跑full数据集,就会出错

而且我发现BEVFormer跑full,第一个epoch前面batch内存占用51G左右 但是加上seg后,同样跑full,第一个epoch前面batch内存直接占用了192G左右

这个内存占用我可能需要优化一下,请问有什么方向吗?

感谢!!!

admyxs commented 6 months ago

老哥,我又发现了一个事情, 1、您这里 projects/mmdet3d_plugin/datasets/utils/rasterize.py ` def mask_for_lines(lines, mask, thickness, idx, type='index', angle_class=36): coords = np.asarray(list(lines.coords), np.int32) coords = coords.reshape((-1, 2)) if len(coords) < 2: return mask, idx if type == 'backward': coords = np.flip(coords, 0)

if type == 'index':
    **cv2.polylines(mask, [coords], False, color=idx, thickness=thickness)**
    idx += 1
else:
    for i in range(len(coords) - 1):
        cv2.polylines(mask, [coords[i:]], False, color=get_discrete_degree(
            coords[i + 1] - coords[i], angle_class=angle_class), thickness=thickness)
return mask, idx

` 在生成mask的时候,所有生成的都是线,而对于区域来说,这样是不是就是不太友好啦,比如drivable_area等非线性区域,感谢回答

2、如果想要生成area样子的呢?怎么办呢?

Bin-ze commented 6 months ago

据我了解,再nuscene数据集上分割标注是由向量标注的,也就是其储存形式就是线段,所以我认为这样是合理的,它应该可以处理你描述的drivable_area,因为在hdmapnet中就是这样做的,我的实现参考了hdmapnet。 其次如果你想要分割更多的对象,应该使用LSS算法中的标注,我在该算法库中实现了它,参考了一些其他代码库的实现,但是我发现其他算法库的实现似乎存在一些问题,故并没有将实验结果放在readme中,但你可以尝试一下

admyxs commented 6 months ago

再次感谢您的工作和细心解答,我正在尝试直接将从nuscenes读取出来的polygon经过变换之后,使用cv.fillpoly投影到到mask中, 但是还有一个问题有点想不明白: 在您的配置文件中, ` map_grid_conf = {

'xbound': [-30.0, 30.0, 0.15],

'ybound': [-15.0, 15.0, 0.15],

'zbound': [-10.0, 10.0, 20.0],

'dbound': [1.0, 60.0, 1.0],

} ` xbound表示的是横轴方向还是纵轴方向呢?如果表示的是横轴方向,那么岂不是在自车的左右各30米,而前后各15米呢

换句话讲,上面提到的map_grid_conf代表下面哪张图呢(箭头代表自车的车头朝向) 无标题

再次感谢您的回答!!!

Bin-ze commented 6 months ago

map_grid_conf代表图B,你可以从我首页发布的可视化结果看出

admyxs commented 6 months ago

是的,非常感谢您,我两分钟前将每个mask可视化了一下,发现似乎在mask中是一个旋转90°的样子, image 做成这个样子应该算是把区域给正确标注了, 然后我再做一下overlap的冲突的修改就可以满足我的要求了,刚刚看到了结果,心情激动,再次感谢

Bin-ze commented 6 months ago

如有问题可打开讨论