fundamentalvision / Deformable-DETR

Deformable DETR: Deformable Transformers for End-to-End Object Detection.
Apache License 2.0
3.15k stars 513 forks source link

Training Error Epoch: [0] [20610/82783] #61

Open SunshineJZJ opened 3 years ago

SunshineJZJ commented 3 years ago

Epoch: [0] [20610/82783] eta: 14:20:51 lr: 0.000200 class_error: 66.67 grad_norm: 702.07 loss: 25.4857 (25.6894) loss_ce: 1.7844 (1.7913) loss_bbox: 1.0630 (1.2122) loss_giou: 1.1202 (1.2633) loss_ce_0: 1.8659 (1.7882) loss_bbox_0: 1.1029 (1.1211) loss_giou_0: 1.1555 (1.2235) loss_ce_1: 1.8321 (1.7994) loss_bbox_1: 1.2103 (1.2626) loss_giou_1: 1.2332 (1.2770) loss_ce_2: 1.8767 (1.7962) loss_bbox_2: 1.3065 (1.2480) loss_giou_2: 1.2509 (1.2752) loss_ce_3: 1.8755 (1.7954) loss_bbox_3: 1.2252 (1.2533) loss_giou_3: 1.1658 (1.2724) loss_ce_4: 1.8533 (1.7937) loss_bbox_4: 1.1594 (1.2468) loss_giou_4: 1.2157 (1.2699) loss_ce_unscaled: 0.8922 (0.8956) class_error_unscaled: 100.0000 (76.1125) loss_bbox_unscaled: 0.2126 (0.2424) loss_giou_unscaled: 0.5601 (0.6317) cardinality_error_unscaled: 297.0000 (293.3949) loss_ce_0_unscaled: 0.9329 (0.8941) loss_bbox_0_unscaled: 0.2206 (0.2242) loss_giou_0_unscaled: 0.5778 (0.6118) cardinality_error_0_unscaled: 297.0000 (293.3869) loss_ce_1_unscaled: 0.9161 (0.8997) loss_bbox_1_unscaled: 0.2421 (0.2525) loss_giou_1_unscaled: 0.6166 (0.6385) cardinality_error_1_unscaled: 297.0000 (293.3922) loss_ce_2_unscaled: 0.9384 (0.8981) loss_bbox_2_unscaled: 0.2613 (0.2496) loss_giou_2_unscaled: 0.6255 (0.6376) cardinality_error_2_unscaled: 297.0000 (293.3960) loss_ce_3_unscaled: 0.9378 (0.8977) loss_bbox_3_unscaled: 0.2450 (0.2507) loss_giou_3_unscaled: 0.5829 (0.6362) cardinality_error_3_unscaled: 297.0000 (293.3968) loss_ce_4_unscaled: 0.9267 (0.8968) loss_bbox_4_unscaled: 0.2319 (0.2494) loss_giou_4_unscaled: 0.6079 (0.6350) cardinality_error_4_unscaled: 297.0000 (293.3965) time: 6.1045 data: 0.0000 max mem: 4087 Epoch: [0] [20620/82783] eta: 14:36:03 lr: 0.000200 class_error: 100.00 grad_norm: 587.85 loss: 24.7180 (25.6888) loss_ce: 1.7637 (1.7913) loss_bbox: 1.0630 (1.2122) loss_giou: 1.0010 (1.2633) loss_ce_0: 1.7789 (1.7881) loss_bbox_0: 1.0888 (1.1212) loss_giou_0: 1.1599 (1.2235) loss_ce_1: 1.7677 (1.7993) loss_bbox_1: 1.1348 (1.2626) loss_giou_1: 1.1588 (1.2769) loss_ce_2: 1.8210 (1.7961) loss_bbox_2: 1.1953 (1.2480) loss_giou_2: 1.0368 (1.2751) loss_ce_3: 1.7686 (1.7954) loss_bbox_3: 1.1945 (1.2532) loss_giou_3: 1.1658 (1.2723) loss_ce_4: 1.7806 (1.7937) loss_bbox_4: 1.1594 (1.2468) loss_giou_4: 1.0990 (1.2699) loss_ce_unscaled: 0.8819 (0.8956) class_error_unscaled: 100.0000 (76.1136) loss_bbox_unscaled: 0.2126 (0.2424) loss_giou_unscaled: 0.5005 (0.6316) cardinality_error_unscaled: 297.0000 (293.3941) loss_ce_0_unscaled: 0.8894 (0.8941) loss_bbox_0_unscaled: 0.2178 (0.2242) loss_giou_0_unscaled: 0.5799 (0.6117) cardinality_error_0_unscaled: 297.0000 (293.3861) loss_ce_1_unscaled: 0.8839 (0.8997) loss_bbox_1_unscaled: 0.2270 (0.2525) loss_giou_1_unscaled: 0.5794 (0.6385) cardinality_error_1_unscaled: 297.0000 (293.3914) loss_ce_2_unscaled: 0.9105 (0.8981) loss_bbox_2_unscaled: 0.2391 (0.2496) loss_giou_2_unscaled: 0.5184 (0.6375) cardinality_error_2_unscaled: 297.0000 (293.3952) loss_ce_3_unscaled: 0.8843 (0.8977) loss_bbox_3_unscaled: 0.2389 (0.2506) loss_giou_3_unscaled: 0.5829 (0.6362) cardinality_error_3_unscaled: 297.0000 (293.3960) loss_ce_4_unscaled: 0.8903 (0.8968) loss_bbox_4_unscaled: 0.2319 (0.2494) loss_giou_4_unscaled: 0.5495 (0.6349) cardinality_error_4_unscaled: 297.0000 (293.3957) time: 18.9128 data: 0.0000 max mem: 4087 Traceback (most recent call last): File "/home/ailab/miniconda3/envs/deformable_detr/lib/python3.7/multiprocessing/resource_sharer.py", line 142, in _serve with self._listener.accept() as conn: File "/home/ailab/miniconda3/envs/deformable_detr/lib/python3.7/multiprocessing/connection.py", line 456, in accept answer_challenge(c, self._authkey) File "/home/ailab/miniconda3/envs/deformable_detr/lib/python3.7/multiprocessing/connection.py", line 742, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/home/ailab/miniconda3/envs/deformable_detr/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/ailab/miniconda3/envs/deformable_detr/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/ailab/miniconda3/envs/deformable_detr/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer ./configs/r50_deformable_detr.sh:行 10: 166959 已杀死 python -u main.py --output_dir ${EXP_DIR} ${PY_ARGS} Traceback (most recent call last): File "./tools/launch.py", line 192, in main() File "./tools/launch.py", line 188, in main cmd=process.args) subprocess.CalledProcessError: Command '['./configs/r50_deformable_detr.sh']' returned non-zero exit status 137.

Thank you for your excellent contribution. My GPU: GTX1070
When training Epoch: [0] [20610/82783], there is a error. What should I do?

Lg955 commented 3 years ago

Hi,the main error is not "subprocess.CalledProcessError: Command '['./configs/r50_deformable_detr.sh']' returned non-zero exit status 137.", but is "ConnectionResetError: [Errno 104] Connection reset by peer". Please check here

SunshineJZJ commented 3 years ago

Hi,the main error is not "subprocess.CalledProcessError: Command '['./configs/r50_deformable_detr.sh']' returned non-zero exit status 137.", but is "ConnectionResetError: [Errno 104] Connection reset by peer". Please check here

感谢回答: 我是单卡训练GTX1070 8G显存 CPU内存也是8G,没有使用分布式,直接运行的main.py文件,并且把batchsize改为1,现在是到了val环节会报上面的错误,感觉是内存不够。