aim-uofa / AdelaiDet

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.
https://git.io/AdelaiDet
Other
3.39k stars 652 forks source link

I cannot train SOLOv2 but I can train BlendMask #403

Open Sakurayx opened 3 years ago

Sakurayx commented 3 years ago

I have trained BlendMask successfully over the past few weeks. However when I wanna train SOLOv2, I got an error:

[07/05 11:08:10 adet.trainer]: Starting training from iteration 0 /home/mima_1/anaconda3/envs/detectron2/lib/python3.6/site-packages/torch/nn/functional.py:3063: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode))

/home/mima_1/anaconda3/envs/detectron2/lib/python3.6/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "

/home/mima_1/AdelaiDet/adet/modeling/solov2/solov2.py:188: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.) hit_indices = ((gt_areas >= lower_bound) & (gt_areas <= upper_bound)).nonzero().flatten() double free or corruption (!prev) Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

How to solve this problem?

tjqansthd commented 3 years ago

@Sakurayx

Hi, I got the same problem. In my case, the problem seems to arise at solov2.py's 237-241 line:

gt_masks = gt_masks.permute(1, 2, 0).to(dtype=torch.uint8).cpu().numpy()
gt_masks = imrescale(gt_masks, scale=1./output_stride)
if len(gt_masks.shape) == 2:
    gt_masks = gt_masks[..., None]
gt_masks = torch.from_numpy(gt_masks).to(dtype=torch.uint8, device=device).permute(2, 0, 1)

I think imrescale function seems to cause the memory free problem.

I solved the problem by replacing the codes as follows:

#gt_masks = gt_masks.permute(1, 2, 0).to(dtype=torch.uint8).cpu().numpy()
#gt_masks = imrescale(gt_masks, scale=1./output_stride)
#if len(gt_masks.shape) == 2:
#    gt_masks = gt_masks[..., None]
#gt_masks = torch.from_numpy(gt_masks).to(dtype=torch.uint8, device=device).permute(2, 0, 1)

gt_masks = F.interpolate(gt_masks.float().unsqueeze(0), scale_factor=0.25, mode='bilinear', align_corners=False).squeeze(0)
gt_masks = gt_masks.to(dtype=torch.uint8, device=device)

I hope it will be helpful

Chengyuelan commented 3 years ago

I encountered this problem when I used my own data set to train BlendMask. I would like to ask, how can I solve it? RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

km1562 commented 2 years ago

add below code, when you use a single GPU, i use ABC_NET,a single gpu will produce the same error, because the batch norm is synbatchnorm,

import torch.distributed as dist dist.init_process_group(backend='nccl', init_method='tcp://localhost:23456', rank=0, world_size=1)