Closed SeanNaren closed 1 year ago
Hey @SeanNaren,
Do you think we should re-create the optimizers for DeepSpeed ?
Best, T.C
Hi,
This might not be super related. But when trying to run the example above, I get:
You have not specified an optimizer or scheduler within the DeepSpeed config.Using
configure_optimizers to define optimizer and scheduler.
Using /home/My_NAME/.cache/torch_extensions as PyTorch extensions root...
File "/home/My_NAME/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1430, in verify_ninja_availability raise RuntimeError("Ninja is required to load C++ extensions") RuntimeError: Ninja is required to load C++ extensions `
Am I missing something?
Many thanks @SeanNaren
Hi,
I am facing the same issue while trying to freeze some backbone layers for a multihead model.. Have someone been able to solve the problem?
Hi there. I investigated this issue and found that the cause of it is that DeepSpeed flattens weight tensors in the model. Therefore, unflattening is a solution to this issue. The following modification is an example with optimizer.unflatten
that is added by DeepSpeed.
diff --git a/pytorch_lightning/callbacks/finetuning.py b/pytorch_lightning/callbacks/finetuning.py
index 26ef742ee..2fa3a7880 100644
--- a/pytorch_lightning/callbacks/finetuning.py
+++ b/pytorch_lightning/callbacks/finetuning.py
@@ -258,12 +258,11 @@ class BaseFinetuning(Callback):
def _store(
self,
- pl_module: "pl.LightningModule",
+ mapping: dict,
opt_idx: int,
num_param_groups: int,
current_param_groups: List[Dict[str, Any]],
) -> None:
- mapping = {p: n for n, p in pl_module.named_parameters()}
if opt_idx not in self._internal_optimizer_metadata:
self._internal_optimizer_metadata[opt_idx] = self._apply_mapping_to_param_groups(
current_param_groups, mapping
@@ -283,7 +282,22 @@ class BaseFinetuning(Callback):
num_param_groups = len(optimizer.param_groups)
self.finetune_function(pl_module, trainer.current_epoch, optimizer, opt_idx)
current_param_groups = optimizer.param_groups
- self._store(pl_module, opt_idx, num_param_groups, current_param_groups)
+ mapping = {p: n for n, p in pl_module.named_parameters()}
+
+ # DeepSpeed made optmizer's tensor flatten and assign unflattend method to it.
+ if len(current_param_groups[0]["params"]) == 1 and hasattr(optimizer, "unflatten"):
+ current_param_groups = [
+ {
+ "params": [
+ tuple(p.flatten().tolist())
+ for p in optimizer.unflatten(
+ current_param_groups[0]["params"][0], optimizer.round_robin_bit16_groups[0]
+ )
+ ]
+ }
+ ]
+ mapping = {tuple(p.flatten().tolist()): n for p, n in mapping.items()}
+ self._store(mapping, opt_idx, num_param_groups, current_param_groups)
As a result, we can get the following.
/workspaces/pytorch/pytorch-lightning/pytorch_lightning/plugins/training_type/deepspeed.py:20: LightningDeprecationWarning: The `pl.plugins.training_type.deepspeed.DeepSpeedPlugin` is deprecated in v1.6 and will be removed in v1.8. Use `pl.strategies.deepspeed.DeepSpeedStrategy` instead.
rank_zero_deprecation(
/workspaces/pytorch/pytorch-lightning/pytorch_lightning/trainer/connectors/accelerator_connector.py:424: LightningDeprecationWarning: Setting `Trainer(gpus=1)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=1)` instead.
rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in fast_dev_run mode: will run the requested loop using 1 batch(es).
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
`Trainer(limit_test_batches=1)` was configured so 1 batch will be used.
`Trainer(limit_predict_batches=1)` was configured so 1 batch will be used.
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using /home/vscode/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Emitting ninja build file /home/vscode/.cache/torch_extensions/py38_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.408236026763916 seconds
Rank: 0 partition count [1] and sizes[(66, False)]
Using /home/vscode/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0002238750457763672 seconds
| Name | Type | Params
----------------------------------
0 | layer1 | Linear | 1.1 K
1 | layer2 | Linear | 66
----------------------------------
66 Trainable params
1.1 K Non-trainable params
1.1 K Total params
0.004 Total estimated model params size (MB)
Epoch 0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 160.42it/s, loss=0.00583, v_num=]
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
You have not specified an optimizer or scheduler within the DeepSpeed config. Using `configure_optimizers` to define optimizer and scheduler.
Using /home/vscode/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00019049644470214844 seconds
Testing DataLoader 0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1256.16it/s]
Above workaround is messy and only work in storeing its internal state. So we have to add some codes to handle loading from checkpoint correctly. In BaseFinetuning
, load_state_dict
and on_fit_start
are expected to run above features. In concrete, load_state_dict
get _internal_optimizer_metadate
from given state_dict
and on_fit_start
apply it to its optimizer. But this features cannot work with DeepSpeed. Because DeepSpeedStrategy always return True as a return value of restore_checkpoint_after_setup. As a result, on_fit_start
is called before load_state_dict
.
I donβt have any solutions for now. Do you have any good idea?
To call on_fit_start before load_state_dict, is it possible to move calling self._restore_modules_and_callbacks(ckpt_path)
before self._call_callback_hooks("on_fit_startβ)
as following.
pytorch_lightning/trainer/trainer.py | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/pytorch_lightning/trainer/trainer.py b/pytorch_lightning/trainer/trainer.py
index 71cb47b13..1e7c91838 100644
--- a/pytorch_lightning/trainer/trainer.py
+++ b/pytorch_lightning/trainer/trainer.py
@@ -1156,6 +1156,10 @@ class Trainer(
# strategy will configure model and move it to the device
self.strategy.setup(self)
+ if self.strategy.restore_checkpoint_after_setup:
+ log.detail(f"{self.__class__.__name__}: restoring module and callbacks from checkpoint path: {ckpt_path}")
+ self._restore_modules_and_callbacks(ckpt_path)
+
# hook
if self.state.fn == TrainerFn.FITTING:
self._call_callback_hooks("on_fit_start")
@@ -1163,9 +1167,6 @@ class Trainer(
self._log_hyperparams()
- if self.strategy.restore_checkpoint_after_setup:
- log.detail(f"{self.__class__.__name__}: restoring module and callbacks from checkpoint path: {ckpt_path}")
- self._restore_modules_and_callbacks(ckpt_path)
# restore optimizers, etc.
log.detail(f"{self.__class__.__name__}: restoring training state")
π Bug
Port from https://github.com/microsoft/DeepSpeed/issues/1426
The Finetuning callback from Pytorch Lightning crashes when using Deepspeed plugin.
To Reproduce