Open achew010 opened 1 week ago
@achew010 @wynterl i made some progress with this. If we comment out
and replace with
import fms_acceleration_peft
from fms_acceleration_peft.framework_plugin_bnb import _prepare_model_for_kbit_training
model = _prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
gradient_checkpointing_kwargs=args.gradient_checkpointing_kwargs,
)
model = get_peft_model(model, peft_config)
Which suggests one of those lines that were commented out is causing the issue.
Update: the problem 1) is because with the new fix then this https://github.com/huggingface/trl/blob/c3143832cb305139b2551af2e00f008b4d64a981/trl/trainer/sft_trainer.py#L231 does not hold anymore
Root Cause
The root cause is due to recent transformers update to resolve high CPU usage for large quantized models.
meta
device on ranks > 0, and then shards the weights during the preperation steptorch.distributed
memory calls duringAccelerator.prepare_model
, which we observe it being stuck for QLoRAMistral 7B
.What was observed
Running experiments to test new Granite models (e.g.
ibm/PowerLM-3b
) available on Transformers==4.45.0.dev0
. Encountered the following issues;Hanging inside
trainer.train()
leading to an eventual distributed timeout error for FSDP-QLoRA experiments despite only using standard HF libraries in our baseline experiments.Issue with failing to install FOAK plugin for FSDP-QLoRA. During registration of DDP gradient reduction hooks for LoRA adapters, weights cannot be casted to
cuda
on non-zero ranked devices as there are no actual weights onmeta
, this is due to theefficient-cpu-ram-mode
fix that now puts all weights of non-zero ranked devices onmeta
device.Reproduce
Issue 1
Issue 2
Dependencies