NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.43k stars 1.41k forks source link

raise RuntimeError("Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized. " #398

Open ayst123 opened 5 years ago

ayst123 commented 5 years ago

Bug

Encounter an error. Same error comes for both pytorch-nightly and pytorch-1.1.0

have re-installed apex and maskrcnn-benchmark many times. Can anyone give a help? Thanks!

Traceback (most recent call last):
  File "tools/train_net.py", line 171, in <module>
    main()
  File "tools/train_net.py", line 164, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/home/ubuntu/msrcnn/maskscoring_rcnn/maskrcnn_benchmark/engine/trainer.py", line 83, in do_train
    with amp.scale_loss(losses, optimizer) as scaled_losses:
  File "/home/ubuntu/miniconda3/envs/msrcnn/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/ubuntu/miniconda3/envs/msrcnn/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 81, in scale_loss
    raise RuntimeError("Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  "
RuntimeError: Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before `with amp.scale_loss`.

Environment

Here is my env info. Same error comes for both pytorch-nightly and pytorch-1.1.0

Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.13.3

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 418.40.04
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.15.4
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl_fft                   1.0.12           py36ha843d7b_0  
[conda] mkl_random                1.0.2            py36hd81dba3_0  
[conda] pytorch                   1.1.0           py3.6_cuda10.0.130_cudnn7.5.1_0    pytorch
[conda] torchvision               0.3.0           py36_cu10.0.130_1    pytorch
ptrblck commented 5 years ago

Hi @ayst123,

could you post a (small) reproducible code snippet? Did you properly initialize the model before calling the loss scaler as suggested in the error message?