dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5k stars 1.33k forks source link

Warning: Moving average ignored a value of inf #359

Open sdimantsd opened 4 years ago

sdimantsd commented 4 years ago

Hi, im try to train yolact to detect cars with images from COCO. I take all of the images with cars in it and make dataset from them. My config look like this: ` only_cars_coco2017_dataset = dataset_base.copy({ 'name': 'cars COCO 2017',

# Training images and annotations
'train_info': '/home/ws/data/COCO/only_cars_train.json',
'train_images':   '/home/ws/data/COCO/train/train2017/',

# Validation images and annotations.
'valid_info': '/home/ws/data/COCO/only_cars_val.json',
'valid_images':   '/home/ws/data/COCO/val/val2017/',

'class_names': ('car'),
'label_map': {1: 1}

})

yolact_im200_coco_cars_config = yolact_base_config.copy({ 'name': 'yolact_im200_coco_cars',

# Dataset stuff
'dataset': only_cars_coco2017_dataset,
'num_classes': len(only_cars_coco2017_dataset.class_names) + 1,

'masks_to_train': 20,
'max_num_detections': 20,
'max_size': 200,
'backbone': yolact_base_config.backbone.copy({
    'pred_scales': [[int(x[0] / yolact_base_config.max_size * 200)] for x in yolact_base_config.backbone.pred_scales],
}),

}) `

After a few iterations, my loss going very high...

Can somwone help me with this?

Update: Also if im train with full COCO dataset i get the same error...

jasonkena commented 4 years ago

I'm not sure, it may be that the Mask-Rescoring network has fully converged (but this is unlikely).

But usually, I just disable the Mask-Rescoring loss.

Auth0rM0rgan commented 4 years ago

@jasonkena Also, I'm getting Keywords error I during training when 'use_amp' is True from this line https://github.com/dbolya/yolact/blob/092554ad707c2749631dfe545c8a953b2b3f4a68/train.py#L168 I can get rid of it with try-except.

What would be the impacts of disabling Mask-Rescoring loss on the model's performance? is it going to damage the performance?

jasonkena commented 4 years ago

Can you try cloning my branch on a completely new directory? @sdimantsd and I didn't get any of your errors running it out of the box.

According to the YOLACT++ paper, the Mask-Rescoring loss improves the performance by 1 mAP.

Auth0rM0rgan commented 4 years ago

Hey @jasonkena, I'm getting this error during testing with eval.py. need to set up AMP inside the eval.pyas well.

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

jasonkena commented 4 years ago

Again. Please clone my branch from scratch. Neither I or sdimantsd can produce your problem.

You need to install conda for this. Here are the complete instructions. Follow all of them:

  1. git clone -b amp https://github.com/jasonkena/yolact/
  2. git clone https://github.com/NVIDIA/apex
  3. cd apex and pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ and cd ../yolact
  4. Rename name within environment.yml, then conda env create -f environment.yml` (to create a new clean environment)
  5. cd external/DCNv2 and python setup.py build develop
  6. Change use_amp to True in config.py
  7. Setup the rest of the config
Rm1n90 commented 4 years ago

Hey @jasonkena, I've trained my model with amp without any problem but the same as @Auth0rM0rgan when trying to evaluate to model on webcam I'm facing RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same. I've followed the instructions you wrote. Do you know how to solve the problem? Thanks!

jasonkena commented 4 years ago

Can you give the whole traceback?

Rm1n90 commented 4 years ago
Loading model... Done.
Initializing model... Traceback (most recent call last):
  File "eval.py", line 1456, in <module>
    evaluate(net, dataset)
  File "eval.py", line 1191, in evaluate
    evalvideo(net, args.video)
  File "eval.py", line 1079, in evalvideo
    first_batch = eval_network(transform_frame(get_next_frame(vid)))
  File "eval.py", line 961, in eval_network
    out = net(imgs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/yolact.py", line 725, in forward
    outs = self.backbone(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/backbone.py", line 219, in forward
    x = layer(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/backbone.py", line 80, in forward
    out = self.conv3(out)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

I'm not getting this error If I set use_amp=False during eval

jasonkena commented 4 years ago

Sorry @Auth0rM0rgan, I believe you were right. I did not initialize amp within eval.py, which is why the problem only showed up during inference.

@Rm1n90, to fix it I believe you have to add

if args.cuda:
    net = net.cuda()
if cfg.use_amp:
    from apex import amp

    if not args.cuda:
        raise ValueError("amp must be used with CUDA")
    net = amp.initialize(net, opt_level="O1")

before net = CustomDataParallel(net).cuda() (https://github.com/jasonkena/yolact/blob/e1a949445dc0c57eb7c8f10470630faff0ce22e2/eval.py#L913)

I haven't tested it, can you tell me how it turns out?

Rm1n90 commented 4 years ago

@jasonkena, Thanks, Eval now working with AMP.