Closed caiqi closed 9 months ago
Can you share the complete error stacktrace?
Can you share the complete error stacktrace?
@carmocca Thanks! This is a reproducing code:
import lightning as L
import torch.nn as nn
import torch.optim
class Network(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Conv2d(3, 128, 3, 3)
def main():
fabric = L.Fabric(
strategy="deepspeed"
)
fabric.launch()
net_a = Network()
net_b = Network()
optim_a = torch.optim.Adam(net_a.parameters())
net_a, optim_a = fabric.setup(net_a, optim_a)
net_b = fabric.setup(net_b)
return
if __name__ == '__main__':
main()
and this is the full stacktrace:
Traceback (most recent call last):
File "/mnt/afs_xxxxx/project/dreamdata/Model/dreamodel/utils/debug_deepspeed.py", line 25, in <module>
main()
File "/mnt/afs_xxxxx/project/dreamdata/Model/dreamodel/utils/debug_deepspeed.py", line 21, in main
net_b = fabric.setup(net_b)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 242, in setup
module = self._strategy.setup_module(module)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/lightning/fabric/strategies/deepspeed.py", line 337, in setup_module
self._deepspeed_engine, _ = self._initialize_engine(module)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/lightning/fabric/strategies/deepspeed.py", line 584, in _initialize_engine
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
self.optimizer = self._configure_zero_optimizer(optimizer=None)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
assert not isinstance(optimizer, DummyOptim), "zero stage {} requires an optimizer".format(zero_stage)
AssertionError: zero stage 2 requires an optimizer
Traceback (most recent call last):
File "/mnt/afs_xxxxx/project/dreamdata/Model/dreamodel/utils/debug_deepspeed.py", line 25, in <module>
main()
File "/mnt/afs_xxxxx/project/dreamdata/Model/dreamodel/utils/debug_deepspeed.py", line 21, in main
net_b = fabric.setup(net_b)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 242, in setup
module = self._strategy.setup_module(module)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/lightning/fabric/strategies/deepspeed.py", line 337, in setup_module
self._deepspeed_engine, _ = self._initialize_engine(module)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/lightning/fabric/strategies/deepspeed.py", line 584, in _initialize_engine
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
self.optimizer = self._configure_zero_optimizer(optimizer=None)
File "/mnt/afs_xxxxx/anaconda3/envs/pytorch_latest/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
assert not isinstance(optimizer, DummyOptim), "zero stage {} requires an optimizer".format(zero_stage)
AssertionError: zero stage 2 requires an optimizer
My scenario is that net_b is freezed and only used to extract features for net_a, like the vae in text to image diffusion.
Hey @caiqi
So the default deepspeed stage is ZeRO 2. This is a method specific to training, and therefore requires an optimizer. Setting up the model without one would not make any sense. As the error says zero stage 2 requires an optimizer
.
You didn't mention what your intention is in your post. Maybe you just want to do inference and your model is too large. In this case, only stage 3 make sense then:
fabric = Fabric(... strategy="deepspeed_stage_3")
# or
from lightning.fabric.strategies import DeepSpeedStrategy
fabric = Fabric(... strategy=DeepSpeedStrategy(stage=3))
There are of course other ways to make inference more efficient in terms of memory, for example 16-bit, quantization, etc which you should try first before reaching to deepspeed/multi-gpu.
Please let me know if this helps.
@carmocca Thanks! My intention is to train text to image diffusion model using pytorch lightning. The typical text to image diffusion contains a text encoder and a vae which are freezed during training and a unet which is used for training. So my focus is on training and not on inference. From the code, I need to setup several fabric instances, one for text encoder and vae, using deepspeed stage 3 and one for unet, using deepspeed stage 2? Does it need to call launch for each fabric instance? Thanks!
In theory, if you want to use multiple models with different deepspeed stages, then yes, you'll need one Fabric for each. However, I think that wouldn't be necessary in your case, and you could just start by using stage 2. Here is the recipe:
fabric = Fabric(strategy="deepspeed_stage_2", ...)
fabric.launch()
text_encoder = TextEncoder()
vae = VAE()
unet = Unet()
# You only want to train unet
unet, optimizer = fabric.setup(unet, optimizer)
# Freeze the others
text_encoder.eval()
text_encoder.requires_grad(False)
vae.eval()
vae.requires_grad(False)
# Move the frozen ones to GPU, and use them as-is
text_encoder.to(fabric.device)
vae.to(fabric.device)
# Train
...
In theory, if you want to use multiple models with different deepspeed stages, then yes, you'll need one Fabric for each. However, I think that wouldn't be necessary in your case, and you could just start by using stage 2. Here is the recipe:
fabric = Fabric(strategy="deepspeed_stage_2", ...) fabric.launch() text_encoder = TextEncoder() vae = VAE() unet = Unet() # You only want to train unet unet, optimizer = fabric.setup(unet, optimizer) # Freeze the others text_encoder.eval() text_encoder.requires_grad(False) vae.eval() vae.requires_grad(False) # Move the frozen ones to GPU, and use them as-is text_encoder.to(fabric.device) vae.to(fabric.device) # Train ...
Thanks. This works fine.
class Model(pl.LightningModule)
def init(self): self.unet = Unet() self.unet.requires_grad(True) self.vae = Vae() self.vae.requires_grad(False) self.vae.eval()
Now pytorch lightning could check if there is a model where for each parameter `p` , `p.requires_grad==False `holds. In that case, do not wrap it with DeepSpeed. Otherwise, use DeepSpeed.
Bug description
I want to train a text to image diffusion model using lighting. There are mainly three models, a text encoder, a vae and a unet.
I setup them with
It works fine with auto strategy. However, when switch to deepspeed strategy, it reports AssertionError: zero stage 2 requires an optimizer
Is there any solution for such setting when there are mulitple models that do not need training at all?
What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```More info
No response
cc @awaelchli @carmocca @justusschock