Jittor / JDet

JDet is an object detection benchmark based on Jittor. Mainly focus on aerial image object detection (oriented object detection).
Apache License 2.0
191 stars 34 forks source link

fair1m_1_5 baseline无法运行 #49

Open Pull256 opened 2 years ago

Pull256 commented 2 years ago

测试环境

windows10 21H2 wsl2 ubuntu 22.04 LTS 4.19.128-microsoft-standard miniconda3+python3.7 cuda 11.7 显卡型号1060

错误

使用CUDA时

Loading config from: configs/s2anet/s2anet_r50_fpn_1x_fair1m_1_5.py [w 0809 11:21:49.947748 32 init.py:1344] load parameter fc.weight failed ... [w 0809 11:21:49.947908 32 init.py:1344] load parameter fc.bias failed ... [w 0809 11:21:50.017176 32 init.py:1363] load total 267 params, 2 failed Tue Aug 9 11:21:50 2022 Start running Traceback (most recent call last): File "tools/run_net.py", line 56, in main() File "tools/run_net.py", line 47, in main runner.run() File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 84, in run self.train() File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 126, in train losses = self.model(images,targets) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call return self.execute(*args, *kw) File "/home/la/JT/JDet/python/jdet/models/networks/s2anet.py", line 35, in execute outputs = self.bbox_head(features, targets) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call return self.execute(args, *kw) File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 627, in execute return self.loss(outs,self.parse_targets(targets)) File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 360, in loss sampling=self.sampling) File "/home/la/JT/JDet/python/jdet/models/boxes/anchor_target.py", line 74, in anchor_target unmap_outputs=unmap_outputs) File "/home/la/JT/JDet/python/jdet/utils/general.py", line 53, in multi_apply return tuple(map(list, zip(map_results))) File "/home/la/JT/JDet/python/jdet/models/boxes/anchor_target.py", line 127, in anchor_target_single if not inside_flags.any(0): File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1735, in to_bool return ori_bool(v.item()) RuntimeError: [f 0809 11:21:58.030555 32 executor.cc:665] Execute fused operator(26/2009) failed.

[OP TYPE]: fused_op:( broadcast_to, reindex, binary.multiply, reduce.add,) [Input]: float32[64,64,3,3,]backbone.layer1.0.conv2.weight, float32[2,64,256,256,],

 tools/run_net.py:56 <<module>>
 tools/run_net.py:47 <main>
 /home/la/JT/JDet/python/jdet/runner/runner.py:84 <run>
 /home/la/JT/JDet/python/jdet/runner/runner.py:126 <train>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/__init__.py:950 <__call__>
 /home/la/JT/JDet/python/jdet/models/networks/s2anet.py:30 <execute>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/__init__.py:950 <__call__>
 /home/la/JT/JDet/python/jdet/models/backbones/resnet.py:166 <execute>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/__init__.py:950 <__call__>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/nn.py:2054 <execute>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/__init__.py:950 <__call__>
 /home/la/JT/JDet/python/jdet/models/backbones/resnet.py:84 <execute>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/__init__.py:950 <__call__>
 /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/nn.py:847 <execute>

[Reason]: [f 0809 11:21:58.030132 32 helper_cuda.h:128] CUDA error at /home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/src/mem/allocator/cuda_managed_allocator.cc:23 code=2( cudaErrorMemoryAllocation ) cudaMallocManaged(&ptr, size)

加入参数--no_cuda

Tue Aug 9 11:15:30 2022 Start running Traceback (most recent call last): File "tools/run_net.py", line 56, in main() File "tools/run_net.py", line 47, in main runner.run() File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 84, in run self.train() File "/home/la/JT/JDet/python/jdet/runner/runner.py", line 126, in train losses = self.model(images,targets) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call return self.execute(*args, kw) File "/home/la/JT/JDet/python/jdet/models/networks/s2anet.py", line 35, in execute outputs = self.bbox_head(features, targets) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call return self.execute(*args, kw) File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 625, in execute outs = multi_apply(self.forward_single, feats, self.anchor_strides) File "/home/la/JT/JDet/python/jdet/utils/general.py", line 53, in multi_apply return tuple(map(list, zip(map_results))) File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 236, in forward_single align_feat = self.align_conv(x, refine_anchor.clone(), stride) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call return self.execute(args, kw) File "/home/la/JT/JDet/python/jdet/models/roi_heads/s2anet_head.py", line 722, in execute x = self.relu(self.deform_conv(x, offset_tensor)) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 950, in call return self.execute(*args, kw) File "/home/la/JT/JDet/python/jdet/ops/dcn_v1.py", line 696, in execute self.dilation, self.groups, self.deformable_groups) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1603, in apply return func(*args, *kw) File "/home/la/miniconda3/envs/JT/lib/python3.7/site-packages/jittor/init.py", line 1559, in call ori_res = self.execute(args) File "/home/la/JT/JDet/python/jdet/ops/dcn_v1.py", line 589, in execute raise NotImplementedError NotImplementedError

已经进行的操作

搜索了一下, code=2( cudaErrorMemoryAllocation )似乎和内存有关,当程序需要的内存不足时会报错,在本res里面搜索,发现有个issue有类似的错误代码,但不知道如何解决 而加入不使用cuda的参数后报错我也不是很理解,只知道是子类没有实现父类要求一定要实现的接口

514flowey commented 2 years ago

有可能是显存不够导致的,运行时需要大约12G的显存。最好使用12G以上显存的显卡。 可以把batch_size改为1试试。

Pull256 commented 2 years ago

有可能是显存不够导致的,运行时需要大约12G的显存。最好使用12G以上显存的显卡。 可以把batch_size改为1试试。

感谢您的回复,在configs/s2anet/s2anet_r50_fpn_1x_fair1m_1_5.py更改batch_size=1之后,错误依旧,同时参考论坛做了几个测试 python -m jittor.test.test_cuda 正常 python -m jittor.test.test_array 报错 code=700( cudaErrorIllegalAddress ) , code=4( CUDNN_STATUS_INTERNAL_ERROR ) python -m jittor.test.test_resnet 训练正常 加入环境变量use_cuda_managed_allocator=0后,错误依旧 感觉似乎确实是电脑的问题,因为python -m jittor.test.test_resnet是正常的,但是换一个数据集就不行了,不知道我的推测对不对。

514flowey commented 2 years ago

请问您的电脑大概拥有多大的显存呢?

Pull256 commented 2 years ago

我是1060-6g的笔记本电脑,和您说的12g相差甚远😂

cxjyxxme commented 2 years ago

那您可以尝试一些更轻量的模型试试,或者租用服务器训练。

vicdxxx commented 2 years ago

我2080Ti 11G现存也跑不起来,一样的问题