RuntimeError: Error(s) in loading state_dict for DistributedDataParallel

BAAI-DCAI / SegVol

The official code for "SegVol: Universal and Interactive Volumetric Medical Image Segmentation".

MIT License

245 stars 22 forks source link

RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

Open Wanyidon opened 11 months ago

Wanyidon commented 11 months ago

Thank you very much for your great work on open source. I encountered the following problems when training the model according to the training sequence, dataset, and weight values provided by you： RuntimeError: Error(s) in loading state_dict for DistributedDataParallel: Unexpected key(s) in state_dict: "module.text_encoder.clip_text_model.text_model.embeddings.position_ids". I sincerely hope to receive your reply.

Yuxin-Du-Lab commented 11 months ago

Could you provide more details about the version of your 'transformers' pak? We recommand running SegVol on 'transformers==4.18.0'.

Wanyidon commented 11 months ago

Thank you very much for your reply. I have resolved the issue as per your suggestion and have also discovered that adding False to load_state_dict can also resolve the issue. However, during the code run, only gpu7 is utilized with 60G of GPU memory, with gpu0-6 only taking up 5G of GPU memory.

Yuxin-Du-Lab commented 11 months ago

I strongly recommend not using 'strict=False' in load_state_dict to load parameters, as this can result in random initialization of some parameters. For the unbalance of GPU memory usage, I think you can double check to see if there are other programs running or there are unkilled zombie processes in your background.

Wanyidon commented 11 months ago

Thank you for your advice. I have checked my program and my current GPU running status is as shown in the image. I wonder if this is correct. 290e14f416658e558e17f0be8205575

Yuxin-Du-Lab commented 10 months ago

Have you fixed the bug yet? I'm sorry I can't reproduce it. I don't know if anyone else has been in a similar situation.🤦

kennyWJB commented 5 months ago

您好，我正在尝试复现demo也遇到了同样的报错，我用的是window系统，cuda==12.2，pytorch==2.0.1，monai==1.3.1(因为0.9.0安装失败)，其余都和推荐版本相同。我目前也是通过strict=False规避该问题。

Yuxin-Du-Lab commented 5 months ago

您好，我正在尝试复现demo也遇到了同样的报错，我用的是window系统，cuda==12.2，pytorch==2.0.1，monai==1.3.1(因为0.9.0安装失败)，其余都和推荐版本相同。我目前也是通过strict=False规避该问题。

请确保'transformers==4.18.0'以及加载的是SegVol_v1.pth

mrokuss commented 5 months ago

transformers==4.25.1 also works and does not fail with the pip installation ;)

Yuxin-Du-Lab commented 3 months ago

Thank you for your advice. I have checked my program and my current GPU running status is as shown in the image. I wonder if this is correct.

The problem may be that the batch_size is too large and the dataset size is limited. Try to reduce the batch_size.