Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to

zsun1029 commented 4 years ago

hi，I use

$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

but I got

`Cleaning up... Removing source in /tmp/pip-req-build-v0deounv Removed build tracker '/tmp/pip-req-tracker-3n3fyj4o' ERROR: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"'; file='"'"'/tmp/p ip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_e xt --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. Exception information: Traceback (most recent call last): File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 153, in _main status = self.run(options, args) File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 455, in run use_user_site=options.use_user_site, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/init.py", line 62, in install_given_reqs **kwargs File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 888, in install cwd=self.unpacked_source_directory, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 275, in runner spinner=spinner, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess raise InstallationError(exc_msg) pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deoun v/setup.py'"'"'; file='"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code , file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. 1 location(s) to search for versions of pip:

http://mirrors.aliyun.com/pypi/simple/pip/ Getting page http://mirrors.aliyun.com/pypi/simple/pip/ Found index url http://mirrors.aliyun.com/pypi/simple/ Starting new HTTP connection (1): mirrors.aliyun.com:80 http://mirrors.aliyun.com:80 "GET /pypi/simple/pip/ HTTP/1.1" 200 12139 Analyzing links from page http://mirrors.aliyun.com/pypi/simple/pip/ Found link http://mirrors.aliyun.com/pypi/packages/18/ad/c0fe6cdfe1643a19ef027c7168572dac6283b80a384ddf21b75b921877da/pip-0.2.1.tar.gz#sha256=83522005c1266cc2de97e65072ff7554ac0f30ad369c3b02ff3a764b9620 48da (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.2.1 Found link http://mirrors.aliyun.com/pypi/packages/3d/9d/1e313763bdfb6a48977b65829c6ce2a43eaae29ea2f907c8bbef024a7219/pip-0.2.tar.gz#sha256=88bb8d029e1bf4acd0e04d300104b7440086f94cc1ce1c5c3c31e3293aee1f 81 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.2 Found link http://mirrors.aliyun.com/pypi/packages/0a/bb/d087c9a1415f8726e683791c0b2943c53f2b76e69f527f2e2b2e9f9e7b5c/pip-0.3.1.tar.gz#sha256=34ce534f17065c78f980702928e988a6b6b2d8a9851aae5f1571a1feb9bb 58d8 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.3.1 Found link http://mirrors.aliyun.com/pypi/packages/17/05/f66144ef69b436d07f8eeeb28b7f77137f80de4bf60349ec6f0f9509e801/pip-0.3.tar.gz#sha256=183c72455cb7f8860ac1376f8c4f14d7f545aeab8ee7c22cd4caf79f35a2ed 47 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.3 Found link http://mirrors.aliyun.com/pypi/packages/cf/c3/153571aaac6cf999f4bb09c019b1ff379b7b599ea833813a41c784eec995/pip-0.4.tar.gz#sha256=28fc67558874f71fddda7168f73595f1650523dce3bc5bf189713ecdfc1e45 6e (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.4 Found link ....... ....... ....... ....... Found link http://mirrors.aliyun.com/pypi/packages/ac/95/a05b56bb975efa78d3557efa36acaf9cf5d2fd0ee0062060493687432e03/pip-9.0.3-py2.py3-none-any.whl#sha256=c3ede34530e0e0b2381e7363aded78e0c33291654937e7373032fda04e8803e5 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 9.0.3 Found link http://mirrors.aliyun.com/pypi/packages/c4/44/e6b8056b6c8f2bfd1445cc9990f478930d8e3459e9dbf5b8e2d2922d64d3/pip-9.0.3.tar.gz#sha256=7bf48f9a693be1d58f49f7af7e0ae9fe29fd671cde8a55e6edca3581c4ef5796 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 9.0.3 Given no hashes to check 131 links for project 'pip': discarding no candidates `

then I use

$ pip install -v --no-cache-dir ./

and I got install success, But when I run my program, I got this

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125

`#Params: 73.7M [57/350] Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError( "No module named 'amp_C'",) Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625`

Can You Help Me about that please?

ptrblck commented 4 years ago

@zsun1029 Could you check your output and see, if it contains invalid values (NaN or Inf)?

panmyuan commented 4 years ago

I have the same problem. Could you help me? I don't know why loss scale to 0 .

kumasento commented 4 years ago

@panmyuan Hey could you please reveal details about your training script? It looks like an interesting case. Thanks!

panmyuan commented 4 years ago

@kumasento Hey, I set option level to O0. It can run without errors.

kumasento commented 4 years ago

Hi there, I am thinking of the full script, including the model :)

On Fri, 27 Dec 2019 at 02:01, panmyuan notifications@github.com wrote:

@kumasento https://github.com/kumasento Hey, I set option level to O0. It can run without errors. [image: image] https://user-images.githubusercontent.com/48192787/71496733-c80dbf80-288f-11ea-9a10-3ca0f00ca5ad.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/apex/issues/635?email_source=notifications&email_token=ACC42RYLTUZRLEEWXXZ4TLLQ2VOWPA5CNFSM4JSPVBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWMP2Q#issuecomment-569165802, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC42R4O4CCDBZF3PRZU7CDQ2VOWPANCNFSM4JSPVBHQ .

CodeMonkZy commented 4 years ago

l have the same problem， but it still can work ，what is the reason of “Gradient overflow.”？thanks！

2020-04-24 19:24:35,018-INFO-Training Epoch:[1][7100/10630] Loss:140.3778(1.9497) Top1:4913.281(68.240%) Top5:6062.500(84.201%), lr=0.01 2020-04-24 19:25:46,816-INFO-Training Epoch:[1][7200/10630] Loss:141.6934(1.9410) Top1:4993.750(68.408%) Top5:6154.688(84.311%), lr=0.01 2020-04-24 19:26:59,096-INFO-Training Epoch:[1][7300/10630] Loss:143.0693(1.9334) Top1:5069.531(68.507%) Top5:6246.094(84.407%), lr=0.01 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 2020-04-24 19:28:11,251-INFO-Training Epoch:[1][7400/10630] Loss:144.4009(1.9253) Top1:5147.656(68.635%) Top5:6336.719(84.490%), lr=0.01 2020-04-24 19:29:23,725-INFO-Training Epoch:[1][7500/10630] Loss:146.0222(1.9213) Top1:5221.094(68.699%) Top5:6425.000(84.539%), lr=0.01 2020-04-24 19:30:36,325-INFO-Training Epoch:[1][7600/10630] Loss:147.7780(1.9192) Top1:5292.969(68.740%) Top5:6513.281(84.588%), lr=0.01 2020-04-24 19:31:48,458-INFO-Training Epoch:[1][7700/10630] Loss:148.9388(1.9095) Top1:5376.562(68.930%) Top5:6607.031(84.706%), lr=0.01 2020-04-24 19:33:00,342-INFO-Training Epoch:[1][7800/10630] Loss:150.5014(1.9051) Top1:5447.656(68.958%) Top5:6699.219(84.800%), lr=0.01 2020-04-24 19:34:12,780-INFO-Training Epoch:[1][7900/10630] Loss:152.1041(1.9013) Top1:5524.219(69.053%) Top5:6785.156(84.814%), lr=0.01 2020-04-24 19:35:24,719-INFO-Training Epoch:[1][8000/10630] Loss:153.5825(1.8961) Top1:5599.219(69.126%) Top5:6874.219(84.867%), lr=0.01 2020-04-24 19:36:36,565-INFO-Training Epoch:[1][8100/10630] Loss:154.6542(1.8860) Top1:5685.938(69.341%) Top5:6969.531(84.994%), lr=0.01 2020-04-24 19:37:49,421-INFO-Training Epoch:[1][8200/10630] Loss:156.1511(1.8813) Top1:5764.062(69.447%) Top5:7059.375(85.053%), lr=0.01 2020-04-24 19:39:01,807-INFO-Training Epoch:[1][8300/10630] Loss:157.8599(1.8793) Top1:5841.406(69.541%) Top5:7148.438(85.100%), lr=0.01 2020-04-24 19:40:14,186-INFO-Training Epoch:[1][8400/10630] Loss:159.1624(1.8725) Top1:5921.875(69.669%) Top5:7241.406(85.193%), lr=0.01 2020-04-24 19:41:26,444-INFO-Training Epoch:[1][8500/10630] Loss:161.0183(1.8723) Top1:5995.312(69.713%) Top5:7324.219(85.165%), lr=0.01 2020-04-24 19:42:38,645-INFO-Training Epoch:[1][8600/10630] Loss:162.6841(1.8699) Top1:6067.969(69.747%) Top5:7410.156(85.174%), lr=0.01 2020-04-24 19:43:50,608-INFO-Training Epoch:[1][8700/10630] Loss:164.1074(1.8649) Top1:6150.781(69.895%) Top5:7499.219(85.218%), lr=0.01 2020-04-24 19:45:03,007-INFO-Training Epoch:[1][8800/10630] Loss:165.5138(1.8597) Top1:6229.688(69.996%) Top5:7590.625(85.288%), lr=0.01 2020-04-24 19:46:15,317-INFO-Training Epoch:[1][8900/10630] Loss:167.1985(1.8578) Top1:6304.688(70.052%) Top5:7677.344(85.304%), lr=0.01 2020-04-24 19:47:28,869-INFO-Training Epoch:[1][9000/10630] Loss:168.7838(1.8548) Top1:6384.375(70.158%) Top5:7765.625(85.337%), lr=0.01 2020-04-24 19:48:40,865-INFO-Training Epoch:[1][9100/10630] Loss:170.1391(1.8493) Top1:6469.531(70.321%) Top5:7857.031(85.403%), lr=0.01 2020-04-24 19:49:53,109-INFO-Training Epoch:[1][9200/10630] Loss:171.8144(1.8475) Top1:6546.094(70.388%) Top5:7945.312(85.433%), lr=0.01 2020-04-24 19:51:05,193-INFO-Training Epoch:[1][9300/10630] Loss:173.3300(1.8439) Top1:6623.438(70.462%) Top5:8032.031(85.447%), lr=0.01 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 2020-04-24 19:52:17,262-INFO-Training Epoch:[1][9400/10630] Loss:174.8439(1.8405) Top1:6701.562(70.543%) Top5:8121.875(85.493%), lr=0.01 2020-04-24 19:53:29,469-INFO-Training Epoch:[1][9500/10630] Loss:176.2916(1.8364) Top1:6776.562(70.589%) Top5:8213.281(85.555%), lr=0.01 2020-04-24 19:54:41,296-INFO-Training Epoch:[1][9600/10630] Loss:177.7239(1.8322) Top1:6857.031(70.691%) Top5:8302.344(85.591%), lr=0.01 2020-04-24 19:55:53,635-INFO-Training Epoch:[1][9700/10630] Loss:178.8308(1.8248) Top1:6941.406(70.831%) Top5:8397.656(85.690%), lr=0.01

sky186 commented 4 years ago

Hi there, I am thinking of the full script, including the model :) … On Fri, 27 Dec 2019 at 02:01, panmyuan @.***> wrote: @kumasento https://github.com/kumasento Hey, I set option level to O0. It can run without errors. [image: image] https://user-images.githubusercontent.com/48192787/71496733-c80dbf80-288f-11ea-9a10-3ca0f00ca5ad.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#635?email_source=notifications&email_token=ACC42RYLTUZRLEEWXXZ4TLLQ2VOWPA5CNFSM4JSPVBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWMP2Q#issuecomment-569165802>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC42R4O4CCDBZF3PRZU7CDQ2VOWPANCNFSM4JSPVBHQ .

@panmyuan hi 你好，apex训练，我出现和你一样的问题Gradient overflow，，模型结构 resnet50+ibn + se, 损失 circleloss+tripletloss，Adam 优化器，请问你是怎么解决的昵？不同的模型结构，或者添加了自定义层的结构，或者优化器，损失函数，对使用apex 都有影响吗？

buzhangjiuzhou commented 4 years ago

@kumasento Hey, I set option level to O0. It can run without errors.

Yes, this works for me, but I still want to know is there any other solutions?

bugbugKiller commented 3 years ago

@kumasento Hey, I set option level to O0. It can run without errors.

Recognized opt_levels are "O0", "O1", "O2", and "O3".

O0 and O3 are not true mixed precision, but they are useful for establishing accuracy and speed baselines, respectively.

O1 and O2 are different implementations of mixed precision. Try both, and see what gives the best speedup and accuracy for your model.

HangJie720 commented 2 years ago

https://github.com/NVIDIA/apex/issues/695 this issue also found that test_loss_scale_decrease fails with some random seeds when opt_level = O1. I wonder if it is related?

751K commented 1 year ago

O0 doesn't seem to have any acceleration effect.（？

NVIDIA / apex

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to #635