Open zsun1029 opened 4 years ago
@zsun1029 Could you check your output and see, if it contains invalid values (NaN or Inf)?
I have the same problem. Could you help me? I don't know why loss scale to 0 .
@panmyuan Hey could you please reveal details about your training script? It looks like an interesting case. Thanks!
@kumasento Hey, I set option level to O0. It can run without errors.
Hi there, I am thinking of the full script, including the model :)
On Fri, 27 Dec 2019 at 02:01, panmyuan notifications@github.com wrote:
@kumasento https://github.com/kumasento Hey, I set option level to O0. It can run without errors. [image: image] https://user-images.githubusercontent.com/48192787/71496733-c80dbf80-288f-11ea-9a10-3ca0f00ca5ad.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/apex/issues/635?email_source=notifications&email_token=ACC42RYLTUZRLEEWXXZ4TLLQ2VOWPA5CNFSM4JSPVBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWMP2Q#issuecomment-569165802, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC42R4O4CCDBZF3PRZU7CDQ2VOWPANCNFSM4JSPVBHQ .
l have the same problem, but it still can work ,what is the reason of “Gradient overflow.”?thanks!
2020-04-24 19:24:35,018-INFO-Training Epoch:[1][7100/10630] Loss:140.3778(1.9497) Top1:4913.281(68.240%) Top5:6062.500(84.201%), lr=0.01 2020-04-24 19:25:46,816-INFO-Training Epoch:[1][7200/10630] Loss:141.6934(1.9410) Top1:4993.750(68.408%) Top5:6154.688(84.311%), lr=0.01 2020-04-24 19:26:59,096-INFO-Training Epoch:[1][7300/10630] Loss:143.0693(1.9334) Top1:5069.531(68.507%) Top5:6246.094(84.407%), lr=0.01 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 2020-04-24 19:28:11,251-INFO-Training Epoch:[1][7400/10630] Loss:144.4009(1.9253) Top1:5147.656(68.635%) Top5:6336.719(84.490%), lr=0.01 2020-04-24 19:29:23,725-INFO-Training Epoch:[1][7500/10630] Loss:146.0222(1.9213) Top1:5221.094(68.699%) Top5:6425.000(84.539%), lr=0.01 2020-04-24 19:30:36,325-INFO-Training Epoch:[1][7600/10630] Loss:147.7780(1.9192) Top1:5292.969(68.740%) Top5:6513.281(84.588%), lr=0.01 2020-04-24 19:31:48,458-INFO-Training Epoch:[1][7700/10630] Loss:148.9388(1.9095) Top1:5376.562(68.930%) Top5:6607.031(84.706%), lr=0.01 2020-04-24 19:33:00,342-INFO-Training Epoch:[1][7800/10630] Loss:150.5014(1.9051) Top1:5447.656(68.958%) Top5:6699.219(84.800%), lr=0.01 2020-04-24 19:34:12,780-INFO-Training Epoch:[1][7900/10630] Loss:152.1041(1.9013) Top1:5524.219(69.053%) Top5:6785.156(84.814%), lr=0.01 2020-04-24 19:35:24,719-INFO-Training Epoch:[1][8000/10630] Loss:153.5825(1.8961) Top1:5599.219(69.126%) Top5:6874.219(84.867%), lr=0.01 2020-04-24 19:36:36,565-INFO-Training Epoch:[1][8100/10630] Loss:154.6542(1.8860) Top1:5685.938(69.341%) Top5:6969.531(84.994%), lr=0.01 2020-04-24 19:37:49,421-INFO-Training Epoch:[1][8200/10630] Loss:156.1511(1.8813) Top1:5764.062(69.447%) Top5:7059.375(85.053%), lr=0.01 2020-04-24 19:39:01,807-INFO-Training Epoch:[1][8300/10630] Loss:157.8599(1.8793) Top1:5841.406(69.541%) Top5:7148.438(85.100%), lr=0.01 2020-04-24 19:40:14,186-INFO-Training Epoch:[1][8400/10630] Loss:159.1624(1.8725) Top1:5921.875(69.669%) Top5:7241.406(85.193%), lr=0.01 2020-04-24 19:41:26,444-INFO-Training Epoch:[1][8500/10630] Loss:161.0183(1.8723) Top1:5995.312(69.713%) Top5:7324.219(85.165%), lr=0.01 2020-04-24 19:42:38,645-INFO-Training Epoch:[1][8600/10630] Loss:162.6841(1.8699) Top1:6067.969(69.747%) Top5:7410.156(85.174%), lr=0.01 2020-04-24 19:43:50,608-INFO-Training Epoch:[1][8700/10630] Loss:164.1074(1.8649) Top1:6150.781(69.895%) Top5:7499.219(85.218%), lr=0.01 2020-04-24 19:45:03,007-INFO-Training Epoch:[1][8800/10630] Loss:165.5138(1.8597) Top1:6229.688(69.996%) Top5:7590.625(85.288%), lr=0.01 2020-04-24 19:46:15,317-INFO-Training Epoch:[1][8900/10630] Loss:167.1985(1.8578) Top1:6304.688(70.052%) Top5:7677.344(85.304%), lr=0.01 2020-04-24 19:47:28,869-INFO-Training Epoch:[1][9000/10630] Loss:168.7838(1.8548) Top1:6384.375(70.158%) Top5:7765.625(85.337%), lr=0.01 2020-04-24 19:48:40,865-INFO-Training Epoch:[1][9100/10630] Loss:170.1391(1.8493) Top1:6469.531(70.321%) Top5:7857.031(85.403%), lr=0.01 2020-04-24 19:49:53,109-INFO-Training Epoch:[1][9200/10630] Loss:171.8144(1.8475) Top1:6546.094(70.388%) Top5:7945.312(85.433%), lr=0.01 2020-04-24 19:51:05,193-INFO-Training Epoch:[1][9300/10630] Loss:173.3300(1.8439) Top1:6623.438(70.462%) Top5:8032.031(85.447%), lr=0.01 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 2020-04-24 19:52:17,262-INFO-Training Epoch:[1][9400/10630] Loss:174.8439(1.8405) Top1:6701.562(70.543%) Top5:8121.875(85.493%), lr=0.01 2020-04-24 19:53:29,469-INFO-Training Epoch:[1][9500/10630] Loss:176.2916(1.8364) Top1:6776.562(70.589%) Top5:8213.281(85.555%), lr=0.01 2020-04-24 19:54:41,296-INFO-Training Epoch:[1][9600/10630] Loss:177.7239(1.8322) Top1:6857.031(70.691%) Top5:8302.344(85.591%), lr=0.01 2020-04-24 19:55:53,635-INFO-Training Epoch:[1][9700/10630] Loss:178.8308(1.8248) Top1:6941.406(70.831%) Top5:8397.656(85.690%), lr=0.01
Hi there, I am thinking of the full script, including the model :) … On Fri, 27 Dec 2019 at 02:01, panmyuan @.***> wrote: @kumasento https://github.com/kumasento Hey, I set option level to O0. It can run without errors. [image: image] https://user-images.githubusercontent.com/48192787/71496733-c80dbf80-288f-11ea-9a10-3ca0f00ca5ad.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#635?email_source=notifications&email_token=ACC42RYLTUZRLEEWXXZ4TLLQ2VOWPA5CNFSM4JSPVBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWMP2Q#issuecomment-569165802>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC42R4O4CCDBZF3PRZU7CDQ2VOWPANCNFSM4JSPVBHQ .
@panmyuan hi 你好,apex训练,我出现和你一样的问题Gradient overflow,,模型结构 resnet50+ibn + se, 损失 circleloss+tripletloss,Adam 优化器 ,请问你是怎么解决的昵? 不同的模型结构,或者添加了自定义层的结构,或者优化器,损失函数,对使用apex 都有影响吗?
@kumasento Hey, I set option level to O0. It can run without errors.
Yes, this works for me, but I still want to know is there any other solutions?
@kumasento Hey, I set option level to O0. It can run without errors.
Recognized opt_levels are "O0", "O1", "O2", and "O3".
O0 and O3 are not true mixed precision, but they are useful for establishing accuracy and speed baselines, respectively.
O1 and O2 are different implementations of mixed precision. Try both, and see what gives the best speedup and accuracy for your model.
https://github.com/NVIDIA/apex/issues/695 this issue also found that test_loss_scale_decrease fails with some random seeds when opt_level = O1. I wonder if it is related?
O0 doesn't seem to have any acceleration effect.(?
hi,I use
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
but I got
`Cleaning up... Removing source in /tmp/pip-req-build-v0deounv Removed build tracker '/tmp/pip-req-tracker-3n3fyj4o' ERROR: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"'; file='"'"'/tmp/p ip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_e xt --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. Exception information: Traceback (most recent call last): File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 153, in _main status = self.run(options, args) File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 455, in run use_user_site=options.use_user_site, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/init.py", line 62, in install_given_reqs **kwargs File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 888, in install cwd=self.unpacked_source_directory, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 275, in runner spinner=spinner, File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess raise InstallationError(exc_msg) pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deoun v/setup.py'"'"'; file='"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code , file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output. 1 location(s) to search for versions of pip:
then I use
$ pip install -v --no-cache-dir ./
and I got install success, But when I run my program, I got this
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
`#Params: 73.7M [57/350] Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError( "No module named 'amp_C'",) Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625`
Can You Help Me about that please?