Training CUDA Out Of Memory error

Kev1MSL commented 5 months ago

Hi! I am trying to train the instantmesh model but I am currently facing issues just before the backpropagation where I am getting cuda out of memory error. Have you faced a similar issue when training and how did you solve this? I am also training on 8 GPUs with same memory as H800, as explained in the paper. Thanks!

sumanttyagi commented 5 months ago

Please check your cuda devices .

gaodalii commented 5 months ago

I am using a single A800(80G), but I can only train it with batch_size=1, if I set batch_size=2, there also would be a cuda out of memory error.

Kev1MSL commented 5 months ago

Yes same thing, when I set batch_size=1 it works, but batch_size=2 it does not. However I am only missing a few GB (~2GB), so I was wondering if there is a way to optimize this? And also what happens if I want to distribute the training across multiple gpus, if I set batch_size=1, is it going to be 1 batch per GPU? Or the 1 batch will be distributed across the GPUs?

Because if it is a batch of size 1, then wouldn't we have issue with converging?

Mrguanglei commented 5 months ago

@Kev1MSL Hello, I encountered several problems in the training process, the structure of my dataset is as the picture says, but my training profile will not be written, I would like to ask for your help, thank you very much for your reply

微信图片_20240531213352 微信图片_20240531213146

throb081 commented 5 months ago

@Kev1MSL hello,i am trying to run the training process,but i don't know how to construct the dataset ,can i have a look at the structure of dataset?thank you very much for your reply

fffh1 commented 4 months ago

@Kev1MSL Hello, may I ask have you made any change to the code ? because I am training the model on the A100 GPU, not even able to train with batch size =1.

ustbzgn commented 4 months ago

@fffh1 Hello, did you solve it? I meet the same problem

fffh1 commented 4 months ago

Hi Check your depth image dimension, shuold be one rather than rgb or rgba. Regards, Feng

From: ustbzgn @.> Sent: Saturday, July 13, 2024 6:26 PM To: TencentARC/InstantMesh @.> Cc: feng hu @.>; Mention @.> Subject: Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98)

@fffh1https://github.com/fffh1 Hello, did you solve it? I meet the same problem

— Reply to this email directly, view it on GitHubhttps://github.com/TencentARC/InstantMesh/issues/98#issuecomment-2226820606, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7EZSWPZJNYTDGBX7RPCOLZMDQEPAVCNFSM6AAAAABIRJIL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAZDANRQGY. You are receiving this because you were mentioned.Message ID: @.***>

ustbzgn commented 4 months ago

I am very thanks

Hi Check your depth image dimension, shuold be one rather than rgb or rgba. Regards, Feng

From: ustbzgn @.> Sent: Saturday, July 13, 2024 6:26 PM To: TencentARC/InstantMesh @.> Cc: feng hu @.>; Mention @.> Subject: Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98)

@fffh1https://github.com/fffh1 Hello, did you solve it? I meet the same problem

— Reply to this email directly, view it on GitHubhttps://github.com/TencentARC/InstantMesh/issues/98#issuecomment-2226820606, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7EZSWPZJNYTDGBX7RPCOLZMDQEPAVCNFSM6AAAAABIRJIL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAZDANRQGY. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

TencentARC / InstantMesh

Training CUDA Out Of Memory error #98