facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.95k stars 787 forks source link

Failure when not using FSDP mixed precision #266

Open schmidt-ai opened 1 year ago

schmidt-ai commented 1 year ago

When training without providing the mixed_precision argument to FSDP, there is an error related to dtype mismatch in dinov2/layers/block.py. Is this expected?

Full stacktrace:

File "/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | Link
-- | --
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: return forward_call(*args, **kwargs) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: output = self._fsdp_wrapped_module(*args, **kwargs) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: return forward_call(*args, **kwargs) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: ret = self.forward_features(*args, **kwargs) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/dinov2/models/vision_transformer.py", line 207, in forward_features_list | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: x = blk(x) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: output = self._fsdp_wrapped_module(*args, **kwargs) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/dinov2/layers/block.py", line 258, in forward | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: return self.forward_nested(x_or_x_list) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/dinov2/layers/block.py", line 226, in forward_nested | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: x_list = drop_add_residual_stochastic_depth_list( | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/dinov2/layers/block.py", line 200, in drop_add_residual_stochastic_depth_list | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: attn_bias, x_cat = get_attn_bias_and_cat(x_list, branges) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: File "/.venv/lib/python3.10/site-packages/dinov2/layers/block.py", line 180, in get_attn_bias_and_cat | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: cat_tensors = index_select_cat([x.flatten(1) for x in x_list], branges).view(1, -1, x_list[0].shape[-1]) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: return _IndexSelectCat.apply(*sources, *indices) | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]: IndexSelect.OPERATOR( | Link
  |   | 2023-10-12T17:54:11.976-06:00 | [3]:RuntimeError: Expected output.scalar_type() == at::ScalarType::Half to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
qasfb commented 1 year ago

Can you try with this ? qasfb-patch-1