Open Thzny opened 1 year ago
本人在测试训练语义分割模型时,因为数据格式不同,采用自己编写的数据集生成器进行训练。
训练时偶发性报错。
terminate called after throwing an instance of 'mgb::CudaError' what(): failed to query event: 700: an illegal memory access was encountered
backtrace: /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb9CudaErrorC1ERKSs+0x54) [0x7f30e95ae164] /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb12CudaCompNode9EventImpl11do_finishedEv+0xbc) [0x7f30e956358c] /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb18CompNodeImplHelper15EventImplHelper8finishedEv+0x57) [0x7f30e957bf47] /opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x3d7c3d) [0x7f31433bac3d] /opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x4fcdff) [0x7f31434dfdff] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f3143bb6b43] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f3143c47bb4] (last_err=700(an illegal memory access was encountered) device=0 mem_free=0.000MiB mem_tot=0.000MiB) Aborted (core dumped)
我猜测是显存泄露了,但是不是很清楚在哪泄露了 训练时我一直监测gpu,显存一直没有使用满,只有一半。
有没有更多报错信息?
背景
本人在测试训练语义分割模型时,因为数据格式不同,采用自己编写的数据集生成器进行训练。
任务描述
训练时偶发性报错。
terminate called after throwing an instance of 'mgb::CudaError' what(): failed to query event: 700: an illegal memory access was encountered
backtrace: /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb9CudaErrorC1ERKSs+0x54) [0x7f30e95ae164] /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb12CudaCompNode9EventImpl11do_finishedEv+0xbc) [0x7f30e956358c] /opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb18CompNodeImplHelper15EventImplHelper8finishedEv+0x57) [0x7f30e957bf47] /opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x3d7c3d) [0x7f31433bac3d] /opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x4fcdff) [0x7f31434dfdff] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f3143bb6b43] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f3143c47bb4] (last_err=700(an illegal memory access was encountered) device=0 mem_free=0.000MiB mem_tot=0.000MiB) Aborted (core dumped)
我猜测是显存泄露了,但是不是很清楚在哪泄露了 训练时我一直监测gpu,显存一直没有使用满,只有一半。