MNIST single GPU example: GradScaler AssertionError

152334H commented 9 months ago

What's the issue, what's expected?: python mnist.py --enable-msamp --opt-level=O2 should work with the versions pinned in pyproject.toml. Specifically, it should work with torch==2.2.1, given that torch is unpinned.

How to reproduce it?: build MS-AMP with torch==2.2.1.

Log message or shapshot?:

$ python mnist.py --enable-msamp --opt-level=O2
[2024-03-05 14:56:15,819] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
msamp is enabled, opt_level: O2
Traceback (most recent call last):
  File "/home/a/MS-AMP/examples/mnist.py", line 185, in <module>
    main()
  File "/home/a/MS-AMP/examples/mnist.py", line 176, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/a/MS-AMP/examples/mnist.py", line 73, in train
    scaler.step(optimizer)
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 447, in step
    self.unscale_(optimizer)
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/a/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 255, in _unscale_grads_
    assert isinstance(param, torch.Tensor)
AssertionError

Additional information: This occurs because

the isinstance check was introduced in this torch commit
optimizer.param_groups[:,'params'] contains ScalingParameters
ScalingParameters subclass ScalingTensor which subclasses nothing, so the isinstance check fails

Commenting out the assertion line manually fixes the issue. I do not know how to reasonably fix this without resorting to that.

tocean commented 3 months ago

Can you share me the details reproduce steps? Seems pytorch 2.2 needs a higher version of NCCL and currently we only supports pytorch 2.1 and 1.4

xrsrke commented 3 months ago

this one works https://github.com/Azure/MS-AMP/issues/178#issuecomment-2240362717

yatorho commented 3 months ago

I met the same problem. My torch version is 2.4.0 with CUDA 12.1:

  File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 182, in <module>
    main()
  File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 173, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/yatorho/doc/projs/MS-AMP/examples/mnist.py", line 73, in train
    scaler.step(optimizer)
  File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 448, in step
    self.unscale_(optimizer)
  File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yatorho/anaconda3/envs/t24/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 256, in _unscale_grads_
    assert isinstance(param, torch.Tensor), f"param is not a Tensor: {type(param)}"
AssertionError: param is not a Tensor: <class 'msamp.nn.parameter.ScalingParameter'>

The param's type is ScalingParameter.

wkcn commented 3 months ago

Hi @yatorho , PyTorch added a new assertion to check whether param is torch.Tensor, but ScalingTensor in MS-AMP is not torch.Tensor.

A temporal solution is to comment the Line 256 in torch/amp/grad_scaler.py: assert isinstance(param, torch.Tensor), f"param is not a Tensor: {type(param)}".

yatorho commented 3 months ago

Thanks! it works for me.

Azure / MS-AMP

MNIST single GPU example: GradScaler AssertionError #168