Warning: Moving average ignored a value of inf

sdimantsd commented 4 years ago

Hi, im try to train yolact to detect cars with images from COCO. I take all of the images with cars in it and make dataset from them. My config look like this: ` only_cars_coco2017_dataset = dataset_base.copy({ 'name': 'cars COCO 2017',

# Training images and annotations
'train_info': '/home/ws/data/COCO/only_cars_train.json',
'train_images':   '/home/ws/data/COCO/train/train2017/',

# Validation images and annotations.
'valid_info': '/home/ws/data/COCO/only_cars_val.json',
'valid_images':   '/home/ws/data/COCO/val/val2017/',

'class_names': ('car'),
'label_map': {1: 1}

})

yolact_im200_coco_cars_config = yolact_base_config.copy({ 'name': 'yolact_im200_coco_cars',

# Dataset stuff
'dataset': only_cars_coco2017_dataset,
'num_classes': len(only_cars_coco2017_dataset.class_names) + 1,

'masks_to_train': 20,
'max_num_detections': 20,
'max_size': 200,
'backbone': yolact_base_config.backbone.copy({
    'pred_scales': [[int(x[0] / yolact_base_config.max_size * 200)] for x in yolact_base_config.backbone.pred_scales],
}),

}) `

After a few iterations, my loss going very high...

Can somwone help me with this?

Update: Also if im train with full COCO dataset i get the same error...

sdimantsd commented 4 years ago

Update: It's happend in YOLACT++ version. The same config and dataset work's find in YOLACT (with no ++) version.

jasonkena commented 4 years ago

I just made a pull request that should fix the inf errors, you can try merging it locally.

sdimantsd commented 4 years ago

Thanks! i will try this next week :)

sdimantsd commented 4 years ago

@jasonkena What is the diffrents from @dbolya repo to you'r repo?

jasonkena commented 4 years ago

For the most part, I added support for Apex's AMP, one of it's features is dynamic loss scaling, so your losses will never overflow. Apex also supports 16-bit precision, so that's a plus.

To enable it, change use_amp in config.py to True

sdimantsd commented 4 years ago

OK, Thx. Why @dbolya don't use it? @jasonkena are you one of dbolya team?

jasonkena commented 4 years ago

Yeah, I just added the pull request about 40 minutes ago, so he might not have read it.

Hahaha, I'm not part of his team, I'm just doing it in my spare time.

sdimantsd commented 4 years ago

Hahaha, I hope i can do it oneday... Thx

sdimantsd commented 4 years ago

@jasonkena one more question. If i start to train with YOLACT (not YOLACT++). Should I marge from you'r repo and continue training?

jasonkena commented 4 years ago

Yes, that should work, but just backup your weights in case anything happens.

sdimantsd commented 4 years ago

OK, but it's may HELP my training?

jasonkena commented 4 years ago

Yes, your weights shouldn't explode

sdimantsd commented 4 years ago

lol thx

sdimantsd commented 4 years ago

@jasonkena , I train with you'r fork, but i always get error/warning: Gtadient overflow...

jasonkena commented 4 years ago

Did you set use_amp to True in config.py? The Gradient Overflow warning is ok, as long as the loss scaler doesn't become 0. The warning means that it is scaling the loss, so it doesn't become infinite.

sdimantsd commented 4 years ago

Yes

jasonkena commented 4 years ago

Does it work? Can you send me a screenshot?

sdimantsd commented 4 years ago

` [ 1] 1440 || B: 5.495 | C: 3.668 | M: 6.919 | S: 1.564 | T: 17.647 || ETA: 197 days, 23:50:39 || timer: 0.298 [ 1] 1450 || B: 5.514 | C: 3.519 | M: 6.986 | S: 1.640 | T: 17.658 || ETA: 197 days, 3:40:56 || timer: 0.291 [ 1] 1460 || B: 5.534 | C: 3.423 | M: 7.028 | S: 1.677 | T: 17.662 || ETA: 197 days, 22:16:54 || timer: 1.842 [ 1] 1470 || B: 5.511 | C: 3.289 | M: 7.060 | S: 1.603 | T: 17.464 || ETA: 197 days, 13:59:05 || timer: 0.328 [ 1] 1480 || B: 5.505 | C: 3.218 | M: 7.148 | S: 1.514 | T: 17.384 || ETA: 197 days, 15:42:25 || timer: 1.852 [ 1] 1490 || B: 5.494 | C: 3.176 | M: 7.190 | S: 1.386 | T: 17.245 || ETA: 197 days, 1:47:05 || timer: 0.303 [ 1] 1500 || B: 5.505 | C: 3.123 | M: 7.184 | S: 1.254 | T: 17.066 || ETA: 197 days, 1:48:12 || timer: 1.197 [ 1] 1510 || B: 5.515 | C: 3.088 | M: 7.223 | S: 1.129 | T: 16.955 || ETA: 196 days, 13:00:18 || timer: 0.310 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 [ 1] 1520 || B: 5.535 | C: 3.031 | M: 8.000 | S: 1.054 | T: 17.619 || ETA: 196 days, 23:30:26 || timer: 2.222 [ 1] 1530 || B: 5.557 | C: 2.994 | M: 8.920 | S: 0.971 | T: 18.442 || ETA: 196 days, 14:56:04 || timer: 0.321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07 [ 1] 1540 || B: 5.597 | C: 2.949 | M: 21.387 | S: 0.881 | T: 30.813 || ETA: 196 days, 14:42:04 || timer: 0.286 [ 1] 1550 || B: 5.622 | C: 2.922 | M: 37.121 | S: 0.804 | T: 46.469 || ETA: 196 days, 12:09:21 || timer: 0.292 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 [ 1] 1560 || B: 5.623 | C: 2.873 | M: 55.813 | S: 0.749 | T: 65.058 || ETA: 196 days, 9:30:19 || timer: 0.550 [ 1] 1570 || B: 5.619 | C: 2.851 | M: 73.614 | S: 0.724 | T: 82.808 || ETA: 196 days, 9:33:39 || timer: 0.296 [ 1] 1580 || B: 5.629 | C: 2.835 | M: 92.922 | S: 0.704 | T: 102.090 || ETA: 196 days, 6:45:42 || timer: 2.162 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 [ 1] 1590 || B: 5.653 | C: 2.812 | M: 110.329 | S: 0.695 | T: 119.489 || ETA: 195 days, 21:04:57 || timer: 0.323 [ 1] 1600 || B: 5.685 | C: 2.799 | M: 127.568 | S: 0.679 | T: 136.732 || ETA: 195 days, 23:16:05 || timer: 0.306 [ 1] 1610 || B: 5.674 | C: 2.791 | M: 143.285 | S: 0.651 | T: 152.400 || ETA: 196 days, 6:41:42 || timer: 0.301 [ 1] 1620 || B: 5.635 | C: 2.779 | M: 159.871 | S: 0.625 | T: 168.909 || ETA: 195 days, 18:32:50 || timer: 0.281 [ 1] 1630 || B: 5.642 | C: 2.775 | M: 176.858 | S: 0.627 | T: 185.902 || ETA: 196 days, 1:03:51 || timer: 0.313 [ 1] 1640 || B: 5.665 | C: 2.775 | M: 184.345 | S: 0.645 | T: 193.430 || ETA: 195 days, 12:20:02 || timer: 1.755 [ 1] 1650 || B: 5.683 | C: 2.774 | M: 188.442 | S: 0.637 | T: 197.537 || ETA: 195 days, 20:48:08 || timer: 0.954 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 [ 1] 1660 || B: 5.685 | C: 2.775 | M: 189.106 | S: 0.636 | T: 198.203 || ETA: 195 days, 21:43:24 || timer: 0.859 [ 1] 1670 || B: 5.712 | C: 2.774 | M: 188.970 | S: 0.629 | T: 198.086 || ETA: 196 days, 0:34:48 || timer: 2.194 [ 1] 1680 || B: 5.740 | C: 2.775 | M: 189.250 | S: 0.629 | T: 198.395 || ETA: 195 days, 14:22:12 || timer: 0.339 [ 1] 1690 || B: 5.714 | C: 2.777 | M: 187.945 | S: 0.631 | T: 197.067 || ETA: 195 days, 8:43:53 || timer: 2.327 [ 1] 1700 || B: 5.663 | C: 2.778 | M: 185.333 | S: 0.638 | T: 194.412 || ETA: 194 days, 19:34:24 || timer: 0.311 [ 1] 1710 || B: 5.686 | C: 2.779 | M: 187.430 | S: 0.638 | T: 196.533 || ETA: 194 days, 11:48:10 || timer: 1.437 [ 1] 1720 || B: 5.697 | C: 2.779 | M: 186.670 | S: 0.641 | T: 195.788 || ETA: 194 days, 6:37:38 || timer: 0.316 [ 1] 1730 || B: 5.682 | C: 2.779 | M: 186.148 | S: 0.650 | T: 195.259 || ETA: 194 days, 7:36:17 || timer: 0.339 [ 1] 1740 || B: 5.645 | C: 2.779 | M: 183.526 | S: 0.637 | T: 192.587 || ETA: 194 days, 8:39:09 || timer: 1.945 [ 1] 1750 || B: 5.640 | C: 2.780 | M: 182.646 | S: 0.656 | T: 191.722 || ETA: 194 days, 7:13:58 || timer: 0.335 [ 1] 1760 || B: 5.663 | C: 2.779 | M: 179.543 | S: 0.653 | T: 188.637 || ETA: 194 days, 17:29:50 || timer: 1.747 [ 1] 1770 || B: 5.640 | C: 2.779 | M: 179.012 | S: 0.656 | T: 188.087 || ETA: 194 days, 8:49:16 || timer: 0.316 [ 1] 1780 || B: 5.643 | C: 2.779 | M: 180.269 | S: 0.654 | T: 189.346 || ETA: 194 days, 11:36:07 || timer: 0.297 [ 1] 1790 || B: 5.672 | C: 2.778 | M: 182.306 | S: 0.652 | T: 191.408 || ETA: 194 days, 7:13:17 || timer: 0.299 [ 1] 1800 || B: 5.698 | C: 2.776 | M: 185.035 | S: 0.637 | T: 194.145 || ETA: 194 days, 6:00:46 || timer: 0.285

`

jasonkena commented 4 years ago

Should work fine, unless the loss scaler becomes something ridiculous like e-40 or something. But if that happens, I'm afraid you'll have to follow the solutions from the other Issues: https://github.com/dbolya/yolact/issues/318#issuecomment-583012166

sdimantsd commented 4 years ago

OK, thx

sdimantsd commented 4 years ago

Currently the loss is going high (start from ~7 and now it's ~180)

jasonkena commented 4 years ago

The total loss right?

sdimantsd commented 4 years ago

Nop, it's ~200

sdimantsd commented 4 years ago

The 'T' loss, right?

jasonkena commented 4 years ago

Yes

sdimantsd commented 4 years ago

It's ~200

jasonkena commented 4 years ago

Yeah, you shouldn't be surprised. Unfortunately the loss scaler makes the loss readings inaccurate, because it multiplies the loss by a factor. So you shouldn't compare losses in between different "Gradient Overflow" warnings. If it still doesn't converge, I'm guessing it's either your batch size or learning rate.

sdimantsd commented 4 years ago

I did't change the learning rate. and im using batch size of 32 with 2 GPUs. it's OK?

jasonkena commented 4 years ago

I think you should override the learning rate. https://github.com/dbolya/yolact/issues/318#issuecomment-582632559 I'm pretty sure that YOLACT multiplies the base learning rate by your batch size.

sdimantsd commented 4 years ago

OK, i will try with batch size of 8

sdimantsd commented 4 years ago

@jasonkena @dbolya This is not working... Im traning with batch size of 8 on one GPU and the loss is going bigger. If im use the same configurations with YOLACT (not ++) it's works fine

jasonkena commented 4 years ago

Sorry, I can't help.

sdimantsd commented 4 years ago

Thx.

sdimantsd commented 4 years ago

@dbolya anything new?

Auth0rM0rgan commented 4 years ago

Hey @jasonkena, @sdimantsd

I wanted to know the performance after training with Apex's AMP. Did you gain better performance or did it speed up your training process?

Also, I'm curious to know is it going to impact the inference time if I train the model with 16-bit precision? (I mean if I train with 16-bit precision, am I going to achieve higher FPS? I have achieved ~25FPS on 1080p video with 32-bit precision)

Thanks

jasonkena commented 4 years ago

@Auth0rM0rgan to be honest, I haven't done any performance/accuracy benchmarks, so I can't say anything for sure. But theoretically, it should improve training time since 16-bit computation is faster. As for the memory consumption, using 16-bit precision saves 1 GB of VRAM with a batch-size of 4.

The benchmark should be pretty straightforward since the AMP branch is compatible with master, you just need to add use_amp in config.py, then you can run the tests in eval.py just as you did in 32-bit.

Auth0rM0rgan commented 4 years ago

Hey @jasonkena,

I'm going to train the model with 16-bit precision and will let you know the performance. Hope I can see improvement in the inference time as well

Auth0rM0rgan commented 4 years ago

Hey @jasonkena, I have tried to use your code but I'm getting this error:

Traceback (most recent call last): File "train.py", line 696, in <module> train() File "train.py", line 281, in train yolact_net = Yolact() File "/home/yolact-amp/yolact.py", line 530, in __init__ self.backbone = construct_backbone(cfg.backbone, cfg.use_amp) File "/home/yolact-amp/backbone.py", line 548, in construct_backbone set_amp(use_amp) NameError: name 'set_amp' is not defined

I fixed the error by importing set_amp like this: from external.DCNv2.dcn_v2 import set_amp

After fixing the error, the model starts to train but sometimes during training, I'm getting Gradient overflow. Is it normal when we use amp?

[ 0] 0 || B: 5.955 | C: 24.126 | M: 5.992 | S: 65.320 | T: 101.393 || ETA: 11 days, 18:19:45 || timer: 2.382 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 [ 0] 10 || B: 6.035 | C: 21.918 | M: 5.816 | S: 57.595 | T: 91.365 || ETA: 4 days, 11:15:02 || timer: 0.747 [ 0] 20 || B: 5.654 | C: 19.253 | M: 5.818 | S: 41.033 | T: 71.758 || ETA: 4 days, 2:41:50 || timer: 0.728 [ 0] 30 || B: 5.576 | C: 17.321 | M: 5.953 | S: 28.553 | T: 57.403 || ETA: 3 days, 23:35:57 || timer: 0.755 [ 0] 40 || B: 5.473 | C: 15.529 | M: 5.935 | S: 22.000 | T: 48.938 || ETA: 3 days, 22:01:17 || timer: 0.748 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 [ 0] 50 || B: 5.403 | C: 14.210 | M: 5.927 | S: 18.069 | T: 43.609 || ETA: 3 days, 21:10:25 || timer: 0.745 [ 0] 60 || B: 5.399 | C: 13.137 | M: 5.981 | S: 15.367 | T: 39.884 || ETA: 3 days, 20:37:17 || timer: 0.776

Thanks

jasonkena commented 4 years ago

Nice catch!

The Gradient Overflow warning is ok, as long as the loss scaler doesn't become 0. The warning means that it is scaling the loss, so it doesn't become infinite.

Yup, it's perfectly normal, it's Apex's AMP's Dynamic Loss Scaling doing its magic.

Auth0rM0rgan commented 4 years ago

@jasonkena Have you tried your code with Yolact++? It seems the code working fine with Yolact but not with Yolact++. Getting this error when using yolact++ config file. No idea how to fix this error :|

Traceback (most recent call last): File "train.py", line 696, in train() File "train.py", line 347, in train yolact_net(torch.zeros(1, 3, cfg.max_size, cfg.max_size).cuda()) File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/yolact-amp/yolact.py", line 725, in forward outs = self.backbone(x) File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/PycharmProjects/yolact-amp/backbone.py", line 221, in forward x = layer(x) File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home//anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/yolact-amp/backbone.py", line 78, in forward out = self.conv2(out) File "/home/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, *kwargs) File "/home/yolact/external/DCNv2/dcn_v2.py", line 128, in forward self.deformable_groups) File "/home/yolact/external/DCNv2/dcn_v2.py", line 31, in forward ctx.deformable_groups) RuntimeError: expected scalar type Float but found Half (data_ptr at /home/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorMethods.h:6321) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7fdd22d32627 in /home/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: float at::Tensor::data_ptr() const + 0xf6 (0x7fdd07657b8a in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so) frame #2: float at::Tensor::data() const + 0x18 (0x7fdd0765b26a in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so) frame #3: dcn_v2_cuda_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0xd48 (0x7fdd076518fd in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so) frame #4: dcn_v2_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0x91 (0x7fdd0763f721 in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so) frame #5: + 0x36cdb (0x7fdd0764ccdb in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so) frame #6: + 0x3351c (0x7fdd0764951c in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so) omitting python frames frame #11: THPFunction_apply(_object, _object*) + 0xa0f (0x7fdd559a7a3f in /home/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Thanks

jasonkena commented 4 years ago

Hmm, it seems like you haven't recompiled the DCNv2 module since you switched to my branch.

Auth0rM0rgan commented 4 years ago

I have recompiled the DCNv2 module when I switched to your branch and when I'm doing it again, It says DCNv2 installed

arous@DeepLearning:~/yolact-amp/external/DCNv2$ python setup.py build develop running build running build_ext running develop running egg_info writing DCNv2.egg-info/PKG-INFO writing dependency_links to DCNv2.egg-info/dependency_links.txt writing top-level names to DCNv2.egg-info/top_level.txt reading manifest file 'DCNv2.egg-info/SOURCES.txt' writing manifest file 'DCNv2.egg-info/SOURCES.txt' running build_ext copying build/lib.linux-x86_64-3.7/_ext.cpython-37m-x86_64-linux-gnu.so -> Creating /home/arous/anaconda3/lib/python3.7/site-packages/DCNv2.egg-link (link to .) DCNv2 0.1 is already the active version in easy-install.pth Installed /home/arous/yolact-amp/external/DCNv2 Processing dependencies for DCNv2==0.1 Finished processing dependencies for DCNv2==0.1

jasonkena commented 4 years ago

You have to delete all the build files before you compile, _ext.cpython*, DCNv2.egg-info/ and build/.

Auth0rM0rgan commented 4 years ago

I did it but still getting the same error :|

jasonkena commented 4 years ago

Sorry, I just realized something in the error you mentioned here. Can you try removing the line you added: from external.DCNv2.dcn_v2 import set_amp, and in the beginning, replace

try:
    from dcn_v2 import DCN, set_amp
except ImportError:

    def DCN(*args, **kwdargs):
        raise Exception(
            "DCN could not be imported. If you want to use YOLACT++ models, compile DCN. Check the README for instructions."
        )

with from dcn_v2 import DCN, set_amp instead, so if the import fails, it raises an error instead?

Auth0rM0rgan commented 4 years ago

If I replace the line that I added (from external.DCNv2.dcn_v2 import set_amp) with

try:
    from dcn_v2 import DCN, set_amp
except ImportError:

    def DCN(*args, **kwdargs):
        raise Exception(
            "DCN could not be imported. If you want to use YOLACT++ models, compile DCN. Check the README for instructions."
        )

I'm getting this error name 'set_amp' is not defined. However, if I import DCNv2 like this from external.DCNv2.dcn_v2 import DCN, set_amp the code will work on Yolact++ as well :)

Thanks

jasonkena commented 4 years ago

The reason it works with YOLACT, although the import fails, is because YOLACT doesn't use DCNv2 at all. Now, the NameError shows up because the ImportError is excepted in the try-except block, so can you remove the try-except block and from external.DCNv2.dcn_v2 import DCN, set_amp with just from dcn_v2 import DCN, set_amp, so that the ImportError shows up?

I cannot reproduce your error, running fresh code on the branch. Can you push all your code to Github, so I can diff the changes?

Auth0rM0rgan commented 4 years ago

Yes, ImportError: cannot import name 'set_amp' from 'dcn_v2' shows up by removing thetry-except block and from external.DCNv2.dcn_v2 import DCN, set_amp. I have to import it like this from external.DCNv2.dcn_v2 import DCN, set_amp to be able to run the code.

jasonkena commented 4 years ago

Sorry, I don't know where the problem is.

Auth0rM0rgan commented 4 years ago

Hey @jasonkena ,

as long as the loss scaler doesn't become 0.

Sometimes I: Mask IoU loss becomes 0 when 'use_amp' is True. Is it ok?

[  0]    1840 || B: 4.008 | C: 6.122 | M: 5.581 | S: 1.371 | I: 0.000 | T: 17.082 || ETA: 6 days, 16:00:06 || timer: 0.439
[  0]    1850 || B: 4.008 | C: 6.092 | M: 5.572 | S: 1.333 | I: 0.000 | T: 17.005 || ETA: 6 days, 16:01:49

dbolya / yolact

Warning: Moving average ignored a value of inf #359