Zero-3 offload support - Githubissues

XenonLamb commented 4 months ago

Is there a way to enable zero3-offload for LLaMA-VID?

I'm trying to integrate a LLM with higher GPU RAM usage to LLaMA-VID, which means I can't run it without offloading to RAM, even at batch_size=1. However, running with zero2-offload seems not to work, as GPU still gets OOM without anything being offloaded to CPU RAM. Moreover, if I run deepspeed with zero3-offload, e.g. { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 1e5, "wall_clock_breakdown": false } The training will get stuck after the model gets initialized. Is there a way to make LLaMA-VID zero-offload compatible? Thank you!

yanwei-li commented 4 months ago

Hi, thanks for this suggestion! Because we do not have enough resources to validate the results of Zero3-offload at this time, we will try to support it later.

xxtars commented 3 months ago

@XenonLamb Hello, I'd like to ask if you have successfully trained using zero3 or zero3_offload. I used the zero3.json provided by llava. However, I encountered some problems when loading qformer. Firstly, llama-vid loaded "bert-base-uncased" through transformers:

mm_model = BertLMHeadModelQF.from_pretrained(
    "bert-base-uncased", config=encoder_config
)

I'm not very familiar with DeepSpeed, and I'm not sure if zero3 handles this part of the loading process. Later, when loading the pretrained qformer, an error occurs:

self.vlm_att_projector.load_state_dict(get_w(att_projector_weights, 'vlm_att_projector'))

Error: bert.encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([0]). Do you know how to handle this? Any help would be greatly appreciated.

XenonLamb commented 3 months ago

@XenonLamb Hello, I'd like to ask if you have successfully trained using zero3 or zero3_offload. I used the zero3.json provided by llava. However, I encountered some problems when loading qformer. Firstly, llama-vid loaded "bert-base-uncased" through transformers:
mm_model = BertLMHeadModelQF.from_pretrained(
    "bert-base-uncased", config=encoder_config
)
I'm not very familiar with DeepSpeed, and I'm not sure if zero3 handles this part of the loading process. Later, when loading the pretrained qformer, an error occurs:
self.vlm_att_projector.load_state_dict(get_w(att_projector_weights, 'vlm_att_projector'))
Error: bert.encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([0]). Do you know how to handle this? Any help would be greatly appreciated.

I encountered similar issues and haven't fully resolved it yet. The shape mismatch error can be "circumvented" by running in zero-2 first and manually save a checkpoint of the bert model. However, even with this approach, my training on zero-3 crashes after ~10 iterations with RuntimeError: still have inflight params [{'id': 1143, 'status': 'AVAILABLE', 'numel': 1982464, 'ds_numel': 1982464, 'shape': (1408, 1408), 'ds_shape': (1408, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([247808])}, {'id': 1145, 'status': 'AVAILABLE', 'numel': 5767168, 'ds_numel': 5767168, 'shape': (4096, 1408), 'ds_shape': (4096, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([720896])}] .

I suspect it has something to do with the tensor shapes produced by bert

xxtars commented 3 months ago

@XenonLamb I ask this question in DeepSpeed. The suggestion seems to be initially effective (test with several iterations), but I haven't verified it across the entire stage. You might give it a try.

BlueBlueFF commented 2 months ago

RuntimeError: still have inflight param

Do you fix this error? RuntimeError: still have inflight param

dvlab-research / LLaMA-VID

Zero-3 offload support #60