PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.81k stars 2.89k forks source link

2.0.0 使用jetson训练报错,查看设备像是内存不足 #4120

Closed helloyan closed 2 years ago

helloyan commented 3 years ago

nvidia@nvidia-desktop:~/AI/PaddleDetection$ python3 -u tools/train.py -c ./configs/ppyolo/ppyolov2_r101vd_dcn_365e_coco.yml --eval -o use_gpu=true WARNING: AVX is not support on your machine. Hence, no_avx core will be imported, It has much worse preformance than avx core. W0906 12:40:56.992522 17361 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.2, Driver API Version: 10.2, Runtime API Version: 10.2 W0906 12:40:57.036255 17361 device_context.cc:372] device: 0, cuDNN Version: 8.0. loading annotations into memory... Done (t=0.35s) creating index... index created! [09/06 12:41:09] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/nvidia/.cache/paddle/weights/ResNet101_vd_ssld_pretrained.pdparams loading annotations into memory... Done (t=0.06s) creating index... index created! [09/06 12:41:17] ppdet.engine INFO: Epoch: [0] [ 0/694] learning_rate: 0.000000 loss_xy: 1.946000 loss_wh: 3.534867 loss_iou: 6.349059 loss_iou_aware: 1.315743 loss_obj: 10502.538086 loss_cls: 3.964794 loss: 10519.648438 eta: 21 days, 7:42:57 batch_cost: 7.2724 data_cost: 0.0140 ips: 0.2750 images/s [09/06 12:41:56] ppdet.engine INFO: Epoch: [0] [ 20/694] learning_rate: 0.000025 loss_xy: 1.624876 loss_wh: 4.526747 loss_iou: 5.998347 loss_iou_aware: 1.248172 loss_obj: 10548.246094 loss_cls: 2.083054 loss: 10586.427734 eta: 5 days, 15:25:34 batch_cost: 1.6574 data_cost: 0.0055 ips: 1.2067 images/s [09/06 12:42:40] ppdet.engine INFO: Epoch: [0] [ 40/694] learning_rate: 0.000050 loss_xy: 2.779396 loss_wh: 9.391282 loss_iou: 10.605574 loss_iou_aware: 1.655790 loss_obj: 18122.472656 loss_cls: 3.091898 loss: 18147.253906 eta: 5 days, 14:54:15 batch_cost: 1.9099 data_cost: 0.0049 ips: 1.0472 images/s [09/06 12:43:22] ppdet.engine INFO: Epoch: [0] [ 60/694] learning_rate: 0.000075 loss_xy: 1.897629 loss_wh: 6.297718 loss_iou: 7.236743 loss_iou_aware: 1.489766 loss_obj: 18725.390625 loss_cls: 2.662554 loss: 18727.998047 eta: 5 days, 13:16:19 batch_cost: 1.8472 data_cost: 0.0049 ips: 1.0827 images/s [09/06 12:44:08] ppdet.engine INFO: Epoch: [0] [ 80/694] learning_rate: 0.000100 loss_xy: 2.896009 loss_wh: 9.597404 loss_iou: 11.019215 loss_iou_aware: 2.083671 loss_obj: 19690.007812 loss_cls: 3.771315 loss: 19731.484375 eta: 5 days, 15:27:58 batch_cost: 2.0214 data_cost: 0.0043 ips: 0.9894 images/s [09/06 12:44:52] ppdet.engine INFO: Epoch: [0] [100/694] learning_rate: 0.000125 loss_xy: 1.866703 loss_wh: 6.104226 loss_iou: 7.545844 loss_iou_aware: 1.289795 loss_obj: 16324.710938 loss_cls: 2.501549 loss: 16345.630859 eta: 5 days, 15:46:55 batch_cost: 1.9493 data_cost: 0.0044 ips: 1.0260 images/s [09/06 12:45:36] ppdet.engine INFO: Epoch: [0] [120/694] learning_rate: 0.000150 loss_xy: 2.586228 loss_wh: 8.971266 loss_iou: 10.000286 loss_iou_aware: 1.792044 loss_obj: 13035.096680 loss_cls: 3.907194 loss: 13063.264648 eta: 5 days, 15:19:31 batch_cost: 1.8921 data_cost: 0.0054 ips: 1.0570 images/s [09/06 12:46:18] ppdet.engine INFO: Epoch: [0] [140/694] learning_rate: 0.000175 loss_xy: 2.207499 loss_wh: 6.565575 loss_iou: 7.611697 loss_iou_aware: 1.629661 loss_obj: 10496.160156 loss_cls: 2.791006 loss: 10524.423828 eta: 5 days, 14:17:52 batch_cost: 1.8222 data_cost: 0.0045 ips: 1.0976 images/s [09/06 12:47:03] ppdet.engine INFO: Epoch: [0] [160/694] learning_rate: 0.000200 loss_xy: 2.788241 loss_wh: 7.735551 loss_iou: 9.352686 loss_iou_aware: 1.638629 loss_obj: 12725.828125 loss_cls: 3.272127 loss: 12745.859375 eta: 5 days, 14:27:31 batch_cost: 1.9293 data_cost: 0.0042 ips: 1.0366 images/s [09/06 12:47:46] ppdet.engine INFO: Epoch: [0] [180/694] learning_rate: 0.000225 loss_xy: 1.641495 loss_wh: 5.247956 loss_iou: 6.212466 loss_iou_aware: 1.257844 loss_obj: 10775.449219 loss_cls: 2.172436 loss: 10799.252930 eta: 5 days, 14:21:58 batch_cost: 1.9016 data_cost: 0.0044 ips: 1.0518 images/s [09/06 12:48:27] ppdet.engine INFO: Epoch: [0] [200/694] learning_rate: 0.000250 loss_xy: 2.180027 loss_wh: 8.158401 loss_iou: 8.442002 loss_iou_aware: 1.384743 loss_obj: 7205.412109 loss_cls: 2.919476 loss: 7235.311523 eta: 5 days, 13:19:02 batch_cost: 1.7625 data_cost: 0.0048 ips: 1.1347 images/s [09/06 12:49:08] ppdet.engine INFO: Epoch: [0] [220/694] learning_rate: 0.000275 loss_xy: 1.477540 loss_wh: 4.268363 loss_iou: 5.836354 loss_iou_aware: 1.186530 loss_obj: 7730.943848 loss_cls: 2.394650 loss: 7739.220703 eta: 5 days, 12:26:31 batch_cost: 1.7603 data_cost: 0.0042 ips: 1.1362 images/s [09/06 12:49:53] ppdet.engine INFO: Epoch: [0] [240/694] learning_rate: 0.000300 loss_xy: 2.590953 loss_wh: 7.593391 loss_iou: 8.896871 loss_iou_aware: 1.569753 loss_obj: 8725.173828 loss_cls: 3.015694 loss: 8748.078125 eta: 5 days, 12:47:54 batch_cost: 1.9468 data_cost: 0.0037 ips: 1.0273 images/s [09/06 12:50:34] ppdet.engine INFO: Epoch: [0] [260/694] learning_rate: 0.000325 loss_xy: 1.451493 loss_wh: 3.573921 loss_iou: 4.793954 loss_iou_aware: 0.891251 loss_obj: 5172.697266 loss_cls: 2.357608 loss: 5193.014648 eta: 5 days, 11:58:31 batch_cost: 1.7383 data_cost: 0.0047 ips: 1.1506 images/s [09/06 12:51:20] ppdet.engine INFO: Epoch: [0] [280/694] learning_rate: 0.000350 loss_xy: 2.495422 loss_wh: 7.380842 loss_iou: 8.571900 loss_iou_aware: 1.489140 loss_obj: 7458.538574 loss_cls: 3.211772 loss: 7475.631348 eta: 5 days, 12:33:20 batch_cost: 1.9956 data_cost: 0.0046 ips: 1.0022 images/s [09/06 12:52:06] ppdet.engine INFO: Epoch: [0] [300/694] learning_rate: 0.000375 loss_xy: 2.667129 loss_wh: 7.736473 loss_iou: 9.646427 loss_iou_aware: 1.848844 loss_obj: 6203.896973 loss_cls: 3.637753 loss: 6225.518555 eta: 5 days, 13:19:41 batch_cost: 2.0536 data_cost: 0.0045 ips: 0.9739 images/s [09/06 12:52:50] ppdet.engine INFO: Epoch: [0] [320/694] learning_rate: 0.000400 loss_xy: 1.905770 loss_wh: 6.750490 loss_iou: 8.367316 loss_iou_aware: 1.303708 loss_obj: 4016.862549 loss_cls: 3.358168 loss: 4045.501953 eta: 5 days, 13:13:09 batch_cost: 1.8746 data_cost: 0.0045 ips: 1.0669 images/s [09/06 12:53:36] ppdet.engine INFO: Epoch: [0] [340/694] learning_rate: 0.000425 loss_xy: 2.464134 loss_wh: 6.947218 loss_iou: 9.279171 loss_iou_aware: 1.710197 loss_obj: 3692.114990 loss_cls: 2.891335 loss: 3709.433594 eta: 5 days, 13:41:21 batch_cost: 2.0123 data_cost: 0.0044 ips: 0.9939 images/s [09/06 12:54:21] ppdet.engine INFO: Epoch: [0] [360/694] learning_rate: 0.000450 loss_xy: 2.449516 loss_wh: 7.250659 loss_iou: 8.426804 loss_iou_aware: 1.686657 loss_obj: 3332.286621 loss_cls: 3.388892 loss: 3359.478516 eta: 5 days, 13:54:13 batch_cost: 1.9603 data_cost: 0.0041 ips: 1.0202 images/s [09/06 12:55:03] ppdet.engine INFO: Epoch: [0] [380/694] learning_rate: 0.000475 loss_xy: 2.405788 loss_wh: 7.137566 loss_iou: 10.130041 loss_iou_aware: 1.706897 loss_obj: 2022.962646 loss_cls: 3.690556 loss: 2088.600586 eta: 5 days, 13:38:09 batch_cost: 1.8360 data_cost: 0.0051 ips: 1.0893 images/s [09/06 12:55:47] ppdet.engine INFO: Epoch: [0] [400/694] learning_rate: 0.000500 loss_xy: 2.386738 loss_wh: 6.854684 loss_iou: 7.736039 loss_iou_aware: 1.336422 loss_obj: 1716.076050 loss_cls: 2.962488 loss: 1755.665527 eta: 5 days, 13:36:14 batch_cost: 1.8959 data_cost: 0.0042 ips: 1.0549 images/s [09/06 12:56:29] ppdet.engine INFO: Epoch: [0] [420/694] learning_rate: 0.000525 loss_xy: 1.952707 loss_wh: 6.033942 loss_iou: 7.721992 loss_iou_aware: 1.239112 loss_obj: 1176.099609 loss_cls: 2.753075 loss: 1204.309326 eta: 5 days, 13:17:56 batch_cost: 1.8135 data_cost: 0.0036 ips: 1.1028 images/s [09/06 12:57:15] ppdet.engine INFO: Epoch: [0] [440/694] learning_rate: 0.000550 loss_xy: 2.112188 loss_wh: 5.578756 loss_iou: 7.561496 loss_iou_aware: 1.409463 loss_obj: 1257.679688 loss_cls: 2.669676 loss: 1268.143677 eta: 5 days, 13:38:53 batch_cost: 2.0105 data_cost: 0.0052 ips: 0.9948 images/s [09/06 12:58:01] ppdet.engine INFO: Epoch: [0] [460/694] learning_rate: 0.000575 loss_xy: 2.165624 loss_wh: 7.141729 loss_iou: 8.751046 loss_iou_aware: 1.758721 loss_obj: 1063.531982 loss_cls: 2.770040 loss: 1095.287598 eta: 5 days, 14:01:01 batch_cost: 2.0272 data_cost: 0.0043 ips: 0.9866 images/s [09/06 12:58:39] ppdet.engine INFO: Epoch: [0] [480/694] learning_rate: 0.000600 loss_xy: 2.724758 loss_wh: 8.343337 loss_iou: 11.915663 loss_iou_aware: 2.314242 loss_obj: 406.304474 loss_cls: 3.728558 loss: 443.010071 eta: 5 days, 13:09:29 batch_cost: 1.6177 data_cost: 0.0053 ips: 1.2364 images/s [09/06 12:59:22] ppdet.engine INFO: Epoch: [0] [500/694] learning_rate: 0.000625 loss_xy: 2.459690 loss_wh: 6.862215 loss_iou: 9.921156 loss_iou_aware: 1.754778 loss_obj: 480.398041 loss_cls: 3.176941 loss: 508.846100 eta: 5 days, 13:02:19 batch_cost: 1.8571 data_cost: 0.0050 ips: 1.0769 images/s [09/06 13:00:09] ppdet.engine INFO: Epoch: [0] [520/694] learning_rate: 0.000650 loss_xy: 2.200006 loss_wh: 6.274585 loss_iou: 7.192174 loss_iou_aware: 1.604848 loss_obj: 403.658112 loss_cls: 2.637300 loss: 426.816742 eta: 5 days, 13:24:29 batch_cost: 2.0355 data_cost: 0.0043 ips: 0.9826 images/s [09/06 13:00:48] ppdet.engine INFO: Epoch: [0] [540/694] learning_rate: 0.000675 loss_xy: 2.002530 loss_wh: 5.911060 loss_iou: 7.535842 loss_iou_aware: 1.361517 loss_obj: 217.963593 loss_cls: 3.270000 loss: 233.659866 eta: 5 days, 12:46:24 batch_cost: 1.6594 data_cost: 0.0046 ips: 1.2053 images/s [09/06 13:01:28] ppdet.engine INFO: Epoch: [0] [560/694] learning_rate: 0.000700 loss_xy: 4.110962 loss_wh: 11.765327 loss_iou: 16.115477 loss_iou_aware: 3.087610 loss_obj: 163.877090 loss_cls: 4.898827 loss: 187.926117 eta: 5 days, 12:21:20 batch_cost: 1.7283 data_cost: 0.0040 ips: 1.1572 images/s [09/06 13:02:10] ppdet.engine INFO: Epoch: [0] [580/694] learning_rate: 0.000725 loss_xy: 1.868880 loss_wh: 6.276283 loss_iou: 7.577869 loss_iou_aware: 1.505556 loss_obj: 174.427719 loss_cls: 2.514143 loss: 203.450104 eta: 5 days, 12:16:28 batch_cost: 1.8559 data_cost: 0.0047 ips: 1.0776 images/s [09/06 13:02:48] ppdet.engine INFO: Epoch: [0] [600/694] learning_rate: 0.000750 loss_xy: 2.299885 loss_wh: 5.491686 loss_iou: 7.798112 loss_iou_aware: 1.453382 loss_obj: 113.781784 loss_cls: 2.972398 loss: 130.447342 eta: 5 days, 11:38:17 batch_cost: 1.6162 data_cost: 0.0041 ips: 1.2375 images/s [09/06 13:03:30] ppdet.engine INFO: Epoch: [0] [620/694] learning_rate: 0.000775 loss_xy: 2.541269 loss_wh: 5.870627 loss_iou: 8.712786 loss_iou_aware: 1.443157 loss_obj: 111.860023 loss_cls: 2.848953 loss: 135.892090 eta: 5 days, 11:24:39 batch_cost: 1.7793 data_cost: 0.0044 ips: 1.1240 images/s [09/06 13:04:11] ppdet.engine INFO: Epoch: [0] [640/694] learning_rate: 0.000800 loss_xy: 1.991383 loss_wh: 5.638469 loss_iou: 7.793421 loss_iou_aware: 1.286494 loss_obj: 86.394112 loss_cls: 2.925562 loss: 108.461365 eta: 5 days, 11:13:11 batch_cost: 1.7896 data_cost: 0.0050 ips: 1.1176 images/s [09/06 13:04:54] ppdet.engine INFO: Epoch: [0] [660/694] learning_rate: 0.000825 loss_xy: 2.387271 loss_wh: 7.555526 loss_iou: 10.004704 loss_iou_aware: 1.658684 loss_obj: 100.845337 loss_cls: 3.399056 loss: 124.710709 eta: 5 days, 11:10:27 batch_cost: 1.8531 data_cost: 0.0047 ips: 1.0793 images/s [09/06 13:05:39] ppdet.engine INFO: Epoch: [0] [680/694] learning_rate: 0.000850 loss_xy: 2.441079 loss_wh: 6.026644 loss_iou: 7.837439 loss_iou_aware: 1.349389 loss_obj: 83.090820 loss_cls: 2.801446 loss: 111.697220 eta: 5 days, 11:24:20 batch_cost: 1.9865 data_cost: 0.0044 ips: 1.0068 images/s [09/06 13:06:34] ppdet.utils.checkpoint INFO: Save checkpoint: output/ppyolov2_r101vd_dcn_365e_coco loading annotations into memory... Done (t=0.10s) creating index... index created! /usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib; see themodule's documentation for alternative uses import fnmatch, glob, traceback, errno, sys, atexit, locale, imp, stat Traceback (most recent call last): File "tools/train.py", line 140, in main() File "tools/train.py", line 136, in main run(FLAGS, cfg) File "tools/train.py", line 111, in run trainer.train(FLAGS.eval) File "/home/nvidia/AI/PaddleDetection/ppdet/engine/trainer.py", line 317, in train self._eval_with_loader(self._eval_loader) File "/home/nvidia/AI/PaddleDetection/ppdet/engine/trainer.py", line 333, in _eval_with_loader outs = self.model(data) File "/usr/local/lib/python3.6/dist-packages/paddle/fluid/dygraph/layers.py", line 902, in call outputs = self.forward(*inputs, *kwargs) File "/home/nvidia/AI/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 29, in forward out = self.get_pred() File "/home/nvidia/AI/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 75, in get_pred bbox_pred, bbox_num = self._forward() File "/home/nvidia/AI/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 68, in _forward self.inputs['im_shape'], self.inputs['scale_factor']) File "/home/nvidia/AI/PaddleDetection/ppdet/modeling/post_process.py", line 59, in call bbox_pred, bboxnum, = self.nms(bboxes, score, self.num_classes) File "/home/nvidia/AI/PaddleDetection/ppdet/modeling/layers.py", line 474, in call normalized=self.normalized) File "/home/nvidia/AI/PaddleDetection/ppdet/modeling/ops.py", line 1112, in matrix_nms out, index, rois_num = core.ops.matrix_nms(bboxes, scores, attrs) SystemError:


C++ Traceback (most recent call last):

0 paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, std::map<std::string, std::string, std::less, std::allocator<std::pair<std::string const, std::string > > > const&) 1 paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, paddle::platform::Place const&, bool, std::map<std::string, std::string, std::less, std::allocator<std::pair<std::string const, std::string > > > const&) 2 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int) 3 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()


Error Message Summary:

FatalError: Operator matrix_nms raises an std::bad_alloc exception. The exception content is :std::bad_alloc. (at /home/paddle/data/wangye19/Paddle/paddle/fluid/imperative/tracer.cc:172)

训练时的设备状态 image

helloyan commented 3 years ago

worker_num = 4 batch_size=4,另一个r50vd也是类似情况,图片输入为1920*1080

qingqing01 commented 3 years ago

您使用的PaddleDetection版本是多少? 看起来是 第一个epoch训练完,做评估时出错了,matrix_nms出错了。

helloyan commented 3 years ago

2.0.0,用的官方打包的whl

qingqing01 commented 3 years ago

您可以更新到最新的2.2版本试下,2.0版本有用户反馈过第一epoch训练完,做评估,因为模型训练的不好,评估会卡住之类,后续又做修复,第一个epoch不做评估。

helloyan commented 3 years ago

2.2官方没有whl包,aarch64平台的,自己编译一堆问题还没搞定

qingqing01 commented 3 years ago

PaddleDetection的也可以不用whl包,直接用源码也可以的。

helloyan commented 3 years ago

2.2的paddledetection需要2.1的paddle,paddle没有2.1的whl包,源码编译安装搞得一头包

helloyan commented 3 years ago

能让工程师编译一个最新版的jetson出来吗

qingqing01 commented 3 years ago

你在Jetson上安装的whl包是哪里找到呢?下面链接中有2.1版本的,可以试下

https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#pytho

helloyan commented 3 years ago

https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html 我从这里下的,只有2.0版本有whl 我没太弄明白paddle inference和我发的链接里的是什么关系,paddledetection依赖的不是paddlepaddle吗 image

qingqing01 commented 3 years ago

Paddle Inference里的whl包是针对Jetson GPU编译的。

qingqing01 commented 3 years ago

另外,您可以先升级PaddleDetection试下,能否跑通,如果能跑通,先用你已安装的PaddlePaddle也可以。

helloyan commented 3 years ago

/usr/bin/ld: cannot find -lcudadevrt /usr/bin/ld: cannot find -lcudart_static 试了网上的方法实在无法搞定,能否请工程师编译一个jetson能用的2.1.2的paddle,jetpack 4.5/4.4

qingqing01 commented 3 years ago

这是jetson几个版本2.1.2的下载链接

nv-jetson-jetpack4.4-all :https://paddle-inference-lib.bj.bcebos.com/2.1.2/nv-jetson-jetpack4.4-all/paddlepaddle_gpu-2.1.2-cp36-cp36m-linux_aarch64.whl

nv-jetson-jetpack4.4-nano:https://paddle-inference-lib.bj.bcebos.com/2.1.2/nv-jetson-jetpack4.4-nano/paddlepaddle_gpu-2.1.2-cp36-cp36m-linux_aarch64.whl

nv-jetson-jetpack4.4-tx2:https://paddle-inference-lib.bj.bcebos.com/2.1.2/nv-jetson-jetpack4.4-tx2/paddlepaddle_gpu-2.1.2-cp36-cp36m-linux_aarch64.whl

nv-jetson-jetpack4.4-xavier:https://paddle-inference-lib.bj.bcebos.com/2.1.2/nv-jetson-jetpack4.4-xavier/paddlepaddle_gpu-2.1.2-cp36-cp36m-linux_aarch64.whl

您可以试下。但需要说明下,我们没有大量测试Jetson上的训练,一般只在Jetson上做推理。

helloyan commented 3 years ago

能跑了,谢谢,我测试看看,有结果更新上来

xuatpham commented 2 years ago

https://paddle-inference-lib.bj.bcebos.com/2.1.2

thank you! it cost too much time to build when changing jetson-version. I did not know that I can also get the wheel from here. Thanks.