Open geomlyd opened 2 weeks ago
We did not encounter this problem by running on H100s or A100s.
I met the same issue, have you ever solved it? @geomlyd
Hello @IceFlameWorm, not exactly: I never managed to get FSDP to run properly. What I did instead, and what seems to work so far, is I switched to using deepspeed. I did this by removing all FSDP-related arguments from the launching script, and by adding the --deepspeed $my_deepspeed_cfg_file
argument that is passed on to the base class of HuggingFace's trainer. In case it's helpful, I will also upload here the .json
config I used for deepspeed.
Note that I say seems to work because a) I've been able to run some training experiments that did what I expected, but my aim so far was not to reproduce the paper's results, and so I can't confirm that everything runs exactly as it should b) I've noticed that at the very end of training, if there's been a checkpoint resumption in between, the program crashes with some sort of triton-related error message. Nevertheless, this seems to happen after the model has been saved to the disk, and so it doesn't seem to have serious negative effects. zero2.json
Thanks for your reply @geomlyd. I also noticed that although the training script crashes in the end, some checkpoint files do be saved. I tried to load the model from these files to run the quick inference code, but I met the following error: ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect. Do you have any solutions or suggestions? Thank you in advance.
@xiaoqian-shen We both still havn't solve this issue yet, could you share your dev enviroment settings?
From my experience, your best bet is probably trying to use deepspeed, inference worked fine for me after training with it (and also whenever I ran inference with the published pretrained checkpoints without any training).
All right, maybe your advice is now the only one solution for me.
@xiaoqian-shen Hi, although some error occurred at line 1109, trainer.train()
, in train.py after finishing a finetune, some files still be saved like the following:
But when I tried to load this finetuned model using the quick inference code, the following errors were thrown out:
Traceback (most recent call last):
File "/data/home/agent_ln/projects/LongVU/infer.py", line 20, in <module>
tokenizer, model, image_processor, context_len = load_pretrained_model(
File "/data/home/agent_ln/projects/LongVU/longvu/builder.py", line 159, in load_pretrained_model
model = CambrianQwenForCausalLM.from_pretrained(
File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained
) = cls._load_pretrained_model(
File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/data/home/agent_ln/miniconda3/envs/longvu/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([143251328]) in "weight" (which has shape torch.Size([152064, 3584])), this looks incorrect.
This may be system-dependent and a bit of a long shot, but I'm having some issues running training. All goes seemingly well until
_save_checkpoint
is called (if it matters, I'm running a toy training session with a dataset of 10 samples), at which point I receive some kind of pytorch synchronization error (originating from agather
op) that indicates that e.g. rank 0's optimizer state has a large number of parameters (presumably, as many as the backbone's) but rank 1's is empty.I'm using
srun
on a machine with 4 A40s. Did you ever encounter anything similar?