Open Pull256 opened 2 years ago
有可能是显存不够导致的,运行时需要大约12G的显存。最好使用12G以上显存的显卡。 可以把batch_size改为1试试。
有可能是显存不够导致的,运行时需要大约12G的显存。最好使用12G以上显存的显卡。 可以把batch_size改为1试试。
感谢您的回复,在configs/s2anet/s2anet_r50_fpn_1x_fair1m_1_5.py更改batch_size=1之后,错误依旧,同时参考论坛做了几个测试 python -m jittor.test.test_cuda 正常 python -m jittor.test.test_array 报错 code=700( cudaErrorIllegalAddress ) , code=4( CUDNN_STATUS_INTERNAL_ERROR ) python -m jittor.test.test_resnet 训练正常 加入环境变量use_cuda_managed_allocator=0后,错误依旧 感觉似乎确实是电脑的问题,因为python -m jittor.test.test_resnet是正常的,但是换一个数据集就不行了,不知道我的推测对不对。
请问您的电脑大概拥有多大的显存呢?
我是1060-6g的笔记本电脑,和您说的12g相差甚远😂
那您可以尝试一些更轻量的模型试试,或者租用服务器训练。
我2080Ti 11G现存也跑不起来,一样的问题
测试环境
windows10 21H2 wsl2 ubuntu 22.04 LTS 4.19.128-microsoft-standard miniconda3+python3.7 cuda 11.7 显卡型号1060
错误
使用CUDA时
Loading config from: configs/s2anet/s2anet_r50_fpn_1x_fair1m_1_5.py [w 0809 11:21:49.947748 32 init.py:1344] load parameter fc.weight failed ... [w 0809 11:21:49.947908 32 init.py:1344] load parameter fc.bias failed ... [w 0809 11:21:50.017176 32 init.py:1363] load total 267 params, 2 failed Tue Aug 9 11:21:50 2022 Start running Traceback (most recent call last): File "tools/run_net.py", line 56, in
main()
File "tools/run_net.py", line 47, in main
runner.run()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 84, in run
self.train()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 126, in train
losses = self.model(images,targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, *kw)
File "/home/la/JT/JDet/python/jdet/models/networks/s2anet.py", line 35, in execute
outputs = self.bbox_head(features, targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(args, *kw)
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 627, in execute
return self.loss(outs,self.parse_targets(targets))
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 360, in loss
sampling=self.sampling)
File "/home/la/JT/JDet/python/jdet/models/boxes/anchor_target.py", line 74, in anchor_target
unmap_outputs=unmap_outputs)
File "/home/la/JT/JDet/python/jdet/utils/general.py", line 53, in multi_apply
return tuple(map(list, zip(map_results)))
File "/home/la/JT/JDet/python/jdet/models/boxes/anchor_target.py", line 127, in anchor_target_single
if not inside_flags.any(0):
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1735, in to_bool
return ori_bool(v.item())
RuntimeError: [f 0809 11:21:58.030555 32 executor.cc:665]
Execute fused operator(26/2009) failed.
[OP TYPE]: fused_op:( broadcast_to, reindex, binary.multiply, reduce.add,) [Input]: float32[64,64,3,3,]backbone.layer1.0.conv2.weight, float32[2,64,256,256,],
[Reason]: [f 0809 11:21:58.030132 32 helper_cuda.h:128] CUDA error at /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/src/mem/allocator/cuda_managed_allocator.cc:23 code=2( cudaErrorMemoryAllocation ) cudaMallocManaged(&ptr, size)
加入参数--no_cuda
Tue Aug 9 11:15:30 2022 Start running Traceback (most recent call last): File "tools/run_net.py", line 56, in
main()
File "tools/run_net.py", line 47, in main
runner.run()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 84, in run
self.train()
File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 126, in train
losses = self.model(images,targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, kw)
File "/home/la/JT/JDet/python/jdet/models/networks/s2anet.py", line 35, in execute
outputs = self.bbox_head(features, targets)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, kw)
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 625, in execute
outs = multi_apply(self.forward_single, feats, self.anchor_strides)
File "/home/la/JT/JDet/python/jdet/utils/general.py", line 53, in multi_apply
return tuple(map(list, zip(map_results)))
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 236, in forward_single
align_feat = self.align_conv(x, refine_anchor.clone(), stride)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(args, kw)
File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 722, in execute
x = self.relu(self.deform_conv(x, offset_tensor))
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call
return self.execute(*args, kw)
File "/home/la/JT/JDet/python/jdet/ops/dcn_v1.py", line 696, in execute
self.dilation, self.groups, self.deformable_groups)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1603, in apply
return func(*args, *kw)
File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1559, in call
ori_res = self.execute(args)
File "/home/la/JT/JDet/python/jdet/ops/dcn_v1.py", line 589, in execute
raise NotImplementedError
NotImplementedError
已经进行的操作
搜索了一下, code=2( cudaErrorMemoryAllocation )似乎和内存有关,当程序需要的内存不足时会报错,在本res里面搜索,发现有个issue有类似的错误代码,但不知道如何解决 而加入不使用cuda的参数后报错我也不是很理解,只知道是子类没有实现父类要求一定要实现的接口