MegEngine / Models

采用MegEngine实现的各种主流深度学习模型
Other
303 stars 99 forks source link

大佬求帮助,我在训练语义分割的时候,自定义了模型的数据输入,然后训练时偶发性报错'mgb::CudaError' #125

Open Thzny opened 1 year ago

Thzny commented 1 year ago

背景

本人在测试训练语义分割模型时,因为数据格式不同,采用自己编写的数据集生成器进行训练。

任务描述

训练时偶发性报错。

terminate called after throwing an instance of 'mgb::CudaError' what(): failed to query event: 700: an illegal memory access was encountered

backtrace: /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb9CudaErrorC1ERKSs+0x54) [0x7f30e95ae164] /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb12CudaCompNode9EventImpl11do_finishedEv+0xbc) [0x7f30e956358c] /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb18CompNodeImplHelper15EventImplHelper8finishedEv+0x57) [0x7f30e957bf47] /opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x3d7c3d) [0x7f31433bac3d] /opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x4fcdff) [0x7f31434dfdff] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f3143bb6b43] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f3143c47bb4] (last_err=700(an illegal memory access was encountered) device=0 mem_free=0.000MiB mem_tot=0.000MiB) Aborted (core dumped)

我猜测是显存泄露了,但是不是很清楚在哪泄露了 训练时我一直监测gpu,显存一直没有使用满,只有一半。

FateScript commented 1 year ago

有没有更多报错信息?