NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.4k forks source link

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to #635

Open zsun1029 opened 4 years ago

zsun1029 commented 4 years ago

hi,I use

$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

but I got

`Cleaning up... Removing source in /tmp/pip-req-build-v0deounv Removed build tracker '/tmp/pip-req-tracker-3n3fyj4o' ERROR: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"'; file='"'"'/tmp/p ip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_e xt --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. Exception information: Traceback (most recent call last): File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 153, in _main status = self.run(options, args) File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 455, in run use_user_site=options.use_user_site, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/init.py", line 62, in install_given_reqs **kwargs File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 888, in install cwd=self.unpacked_source_directory, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 275, in runner spinner=spinner, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess raise InstallationError(exc_msg) pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deoun v/setup.py'"'"'; file='"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code , file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. 1 location(s) to search for versions of pip:

then I use

$ pip install -v --no-cache-dir ./

and I got install success, But when I run my program, I got this

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125

`#Params: 73.7M [57/350] Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError( "No module named 'amp_C'",) Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625`

Can You Help Me about that please?

ptrblck commented 4 years ago

@zsun1029 Could you check your output and see, if it contains invalid values (NaN or Inf)?

panmyuan commented 4 years ago

I have the same problem. Could you help me? image image image I don't know why loss scale to 0 .

kumasento commented 4 years ago

@panmyuan Hey could you please reveal details about your training script? It looks like an interesting case. Thanks!

panmyuan commented 4 years ago

@kumasento Hey, I set option level to O0. It can run without errors. image

kumasento commented 4 years ago

Hi there, I am thinking of the full script, including the model :)

On Fri, 27 Dec 2019 at 02:01, panmyuan notifications@github.com wrote:

@kumasento https://github.com/kumasento Hey, I set option level to O0. It can run without errors. [image: image] https://user-images.githubusercontent.com/48192787/71496733-c80dbf80-288f-11ea-9a10-3ca0f00ca5ad.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/apex/issues/635?email_source=notifications&email_token=ACC42RYLTUZRLEEWXXZ4TLLQ2VOWPA5CNFSM4JSPVBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWMP2Q#issuecomment-569165802, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC42R4O4CCDBZF3PRZU7CDQ2VOWPANCNFSM4JSPVBHQ .

CodeMonkZy commented 4 years ago

l have the same problem, but it still can work ,what is the reason of “Gradient overflow.”?thanks!

2020-04-24 19:24:35,018-INFO-Training Epoch:[1][7100/10630] Loss:140.3778(1.9497) Top1:4913.281(68.240%) Top5:6062.500(84.201%), lr=0.01 2020-04-24 19:25:46,816-INFO-Training Epoch:[1][7200/10630] Loss:141.6934(1.9410) Top1:4993.750(68.408%) Top5:6154.688(84.311%), lr=0.01 2020-04-24 19:26:59,096-INFO-Training Epoch:[1][7300/10630] Loss:143.0693(1.9334) Top1:5069.531(68.507%) Top5:6246.094(84.407%), lr=0.01 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 2020-04-24 19:28:11,251-INFO-Training Epoch:[1][7400/10630] Loss:144.4009(1.9253) Top1:5147.656(68.635%) Top5:6336.719(84.490%), lr=0.01 2020-04-24 19:29:23,725-INFO-Training Epoch:[1][7500/10630] Loss:146.0222(1.9213) Top1:5221.094(68.699%) Top5:6425.000(84.539%), lr=0.01 2020-04-24 19:30:36,325-INFO-Training Epoch:[1][7600/10630] Loss:147.7780(1.9192) Top1:5292.969(68.740%) Top5:6513.281(84.588%), lr=0.01 2020-04-24 19:31:48,458-INFO-Training Epoch:[1][7700/10630] Loss:148.9388(1.9095) Top1:5376.562(68.930%) Top5:6607.031(84.706%), lr=0.01 2020-04-24 19:33:00,342-INFO-Training Epoch:[1][7800/10630] Loss:150.5014(1.9051) Top1:5447.656(68.958%) Top5:6699.219(84.800%), lr=0.01 2020-04-24 19:34:12,780-INFO-Training Epoch:[1][7900/10630] Loss:152.1041(1.9013) Top1:5524.219(69.053%) Top5:6785.156(84.814%), lr=0.01 2020-04-24 19:35:24,719-INFO-Training Epoch:[1][8000/10630] Loss:153.5825(1.8961) Top1:5599.219(69.126%) Top5:6874.219(84.867%), lr=0.01 2020-04-24 19:36:36,565-INFO-Training Epoch:[1][8100/10630] Loss:154.6542(1.8860) Top1:5685.938(69.341%) Top5:6969.531(84.994%), lr=0.01 2020-04-24 19:37:49,421-INFO-Training Epoch:[1][8200/10630] Loss:156.1511(1.8813) Top1:5764.062(69.447%) Top5:7059.375(85.053%), lr=0.01 2020-04-24 19:39:01,807-INFO-Training Epoch:[1][8300/10630] Loss:157.8599(1.8793) Top1:5841.406(69.541%) Top5:7148.438(85.100%), lr=0.01 2020-04-24 19:40:14,186-INFO-Training Epoch:[1][8400/10630] Loss:159.1624(1.8725) Top1:5921.875(69.669%) Top5:7241.406(85.193%), lr=0.01 2020-04-24 19:41:26,444-INFO-Training Epoch:[1][8500/10630] Loss:161.0183(1.8723) Top1:5995.312(69.713%) Top5:7324.219(85.165%), lr=0.01 2020-04-24 19:42:38,645-INFO-Training Epoch:[1][8600/10630] Loss:162.6841(1.8699) Top1:6067.969(69.747%) Top5:7410.156(85.174%), lr=0.01 2020-04-24 19:43:50,608-INFO-Training Epoch:[1][8700/10630] Loss:164.1074(1.8649) Top1:6150.781(69.895%) Top5:7499.219(85.218%), lr=0.01 2020-04-24 19:45:03,007-INFO-Training Epoch:[1][8800/10630] Loss:165.5138(1.8597) Top1:6229.688(69.996%) Top5:7590.625(85.288%), lr=0.01 2020-04-24 19:46:15,317-INFO-Training Epoch:[1][8900/10630] Loss:167.1985(1.8578) Top1:6304.688(70.052%) Top5:7677.344(85.304%), lr=0.01 2020-04-24 19:47:28,869-INFO-Training Epoch:[1][9000/10630] Loss:168.7838(1.8548) Top1:6384.375(70.158%) Top5:7765.625(85.337%), lr=0.01 2020-04-24 19:48:40,865-INFO-Training Epoch:[1][9100/10630] Loss:170.1391(1.8493) Top1:6469.531(70.321%) Top5:7857.031(85.403%), lr=0.01 2020-04-24 19:49:53,109-INFO-Training Epoch:[1][9200/10630] Loss:171.8144(1.8475) Top1:6546.094(70.388%) Top5:7945.312(85.433%), lr=0.01 2020-04-24 19:51:05,193-INFO-Training Epoch:[1][9300/10630] Loss:173.3300(1.8439) Top1:6623.438(70.462%) Top5:8032.031(85.447%), lr=0.01 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 2020-04-24 19:52:17,262-INFO-Training Epoch:[1][9400/10630] Loss:174.8439(1.8405) Top1:6701.562(70.543%) Top5:8121.875(85.493%), lr=0.01 2020-04-24 19:53:29,469-INFO-Training Epoch:[1][9500/10630] Loss:176.2916(1.8364) Top1:6776.562(70.589%) Top5:8213.281(85.555%), lr=0.01 2020-04-24 19:54:41,296-INFO-Training Epoch:[1][9600/10630] Loss:177.7239(1.8322) Top1:6857.031(70.691%) Top5:8302.344(85.591%), lr=0.01 2020-04-24 19:55:53,635-INFO-Training Epoch:[1][9700/10630] Loss:178.8308(1.8248) Top1:6941.406(70.831%) Top5:8397.656(85.690%), lr=0.01

sky186 commented 4 years ago

Hi there, I am thinking of the full script, including the model :) On Fri, 27 Dec 2019 at 02:01, panmyuan @.***> wrote: @kumasento https://github.com/kumasento Hey, I set option level to O0. It can run without errors. [image: image] https://user-images.githubusercontent.com/48192787/71496733-c80dbf80-288f-11ea-9a10-3ca0f00ca5ad.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#635?email_source=notifications&email_token=ACC42RYLTUZRLEEWXXZ4TLLQ2VOWPA5CNFSM4JSPVBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWMP2Q#issuecomment-569165802>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC42R4O4CCDBZF3PRZU7CDQ2VOWPANCNFSM4JSPVBHQ .

@panmyuan hi 你好,apex训练,我出现和你一样的问题Gradient overflow,,模型结构 resnet50+ibn + se, 损失 circleloss+tripletloss,Adam 优化器 ,请问你是怎么解决的昵? 不同的模型结构,或者添加了自定义层的结构,或者优化器,损失函数,对使用apex 都有影响吗?

buzhangjiuzhou commented 4 years ago

@kumasento Hey, I set option level to O0. It can run without errors. image

Yes, this works for me, but I still want to know is there any other solutions?

bugbugKiller commented 3 years ago

@kumasento Hey, I set option level to O0. It can run without errors. image

Recognized opt_levels are "O0", "O1", "O2", and "O3".

O0 and O3 are not true mixed precision, but they are useful for establishing accuracy and speed baselines, respectively.

O1 and O2 are different implementations of mixed precision. Try both, and see what gives the best speedup and accuracy for your model.

HangJie720 commented 2 years ago

https://github.com/NVIDIA/apex/issues/695 this issue also found that test_loss_scale_decrease fails with some random seeds when opt_level = O1. I wonder if it is related?

751K commented 1 year ago

O0 doesn't seem to have any acceleration effect.(?