Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP?

geomlyd commented 2 weeks ago

This may be system-dependent and a bit of a long shot, but I'm having some issues running training. All goes seemingly well until _save_checkpoint is called (if it matters, I'm running a toy training session with a dataset of 10 samples), at which point I receive some kind of pytorch synchronization error (originating from a gather op) that indicates that e.g. rank 0's optimizer state has a large number of parameters (presumably, as many as the backbone's) but rank 1's is empty.

I'm using srun on a machine with 4 A40s. Did you ever encounter anything similar?

xiaoqian-shen commented 2 weeks ago

We did not encounter this problem by running on H100s or A100s.

IceFlameWorm commented 1 day ago

I met the same issue, have you ever solved it? @geomlyd

geomlyd commented 1 day ago

Hello @IceFlameWorm, not exactly: I never managed to get FSDP to run properly. What I did instead, and what seems to work so far, is I switched to using deepspeed. I did this by removing all FSDP-related arguments from the launching script, and by adding the --deepspeed $my_deepspeed_cfg_file argument that is passed on to the base class of HuggingFace's trainer. In case it's helpful, I will also upload here the .json config I used for deepspeed.

Note that I say seems to work because a) I've been able to run some training experiments that did what I expected, but my aim so far was not to reproduce the paper's results, and so I can't confirm that everything runs exactly as it should b) I've noticed that at the very end of training, if there's been a checkpoint resumption in between, the program crashes with some sort of triton-related error message. Nevertheless, this seems to happen after the model has been saved to the disk, and so it doesn't seem to have serious negative effects. zero2.json

IceFlameWorm commented 1 day ago

Thanks for your reply @geomlyd. I also noticed that although the training script crashes in the end, some checkpoint files do be saved. I tried to load the model from these files to run the quick inference code, but I met the following error: ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect. Do you have any solutions or suggestions? Thank you in advance.

IceFlameWorm commented 1 day ago

@xiaoqian-shen We both still havn't solve this issue yet, could you share your dev enviroment settings?

geomlyd commented 1 day ago

From my experience, your best bet is probably trying to use deepspeed, inference worked fine for me after training with it (and also whenever I ran inference with the published pretrained checkpoints without any training).

IceFlameWorm commented 1 day ago

All right, maybe your advice is now the only one solution for me.

IceFlameWorm commented 23 hours ago

@xiaoqian-shen Hi, although some error occurred at line 1109, trainer.train() , in train.py after finishing a finetune, some files still be saved like the following: screenshot-20241203-112530

But when I tried to load this finetuned model using the quick inference code, the following errors were thrown out:

Traceback (most recent call last):
  File "/data/home/agent_ln/projects/LongVU/infer.py", line 20, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/data/home/agent_ln/projects/LongVU/longvu/builder.py", line 159, in load_pretrained_model
    model = CambrianQwenForCausalLM.from_pretrained(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect.

Vision-CAIR / LongVU

Did you ever encounter issues with `Trainer`'s `_save_checkpoint` and FSDP? #21