File "/home/olivia/experiments/cot_reliability/trlx_minimal.py", line 73, in <module>
trainer = trlx.train(
File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trlx.py", line 92, in train
trainer = get_trainer(config.train.trainer)(
File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trainer/accelerate_ppo_trainer.py", line 74, in __init__
if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'peft_type'
The error comes from these lines in accelerate_ppo_trainer.py:
self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
self.model, self.opt, self.scheduler, rollout_loader
)
self.store.clear_history() # Clear the rollout store
if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
self.ref_model = self.get_arch(self.config)
self.model originally has a peft_type attribute set to None, but in multi-gpu mode it seems like the self.accelerator.prepare call wraps the model in a DistributedDataParallel which doesn't have this attribute.
We can get around this by storing the peft_type attribute from before accelerate.prepare and setting it afterwards. This makes the code run correctly.
However, even with this change, multi-gpu training does not work with using peft to implement LoRA.
If I uncomment the peft_config lines in the example script above and change num_layers_unfrozen to 1, then this seems to work correctly with single-gpu training. However, when I add a second GPU, then the script fails with an error saying that DistributedDataParallel has no attribute forward_hydra.
This problem can be fixed by removing all references to peft_type in accelerate_ppo_trainer.py. (This also makes the fix above unnecesary). When I do this it seems to be running correctly with LoRA on both GPUs. However, I am not familiar enough with this codebase to know if this fix introduces additional errors which are not obvious.
🐛 Describe the bug
When I try to use multi-gpu training with accelerate I get an error.
Code:
Launch command:
Error:
The error comes from these lines in
accelerate_ppo_trainer.py
:self.model
originally has apeft_type
attribute set to None, but in multi-gpu mode it seems like theself.accelerator.prepare
call wraps the model in a DistributedDataParallel which doesn't have this attribute.We can get around this by storing the
peft_type
attribute from beforeaccelerate.prepare
and setting it afterwards. This makes the code run correctly.However, even with this change, multi-gpu training does not work with using peft to implement LoRA.
If I uncomment the
peft_config
lines in the example script above and changenum_layers_unfrozen
to 1, then this seems to work correctly with single-gpu training. However, when I add a second GPU, then the script fails with an error saying that DistributedDataParallel has no attributeforward_hydra
.This problem can be fixed by removing all references to
peft_type
inaccelerate_ppo_trainer.py
. (This also makes the fix above unnecesary). When I do this it seems to be running correctly with LoRA on both GPUs. However, I am not familiar enough with this codebase to know if this fix introduces additional errors which are not obvious.Which trlX version are you using?
trlx==0.7.0
Additional system and package information
python 3.9, transformers 4.35.0, accelerate 0.24.1, Ubuntu