Open Kev1MSL opened 5 months ago
Please check your cuda devices .
I am using a single A800(80G), but I can only train it with batch_size=1, if I set batch_size=2, there also would be a cuda out of memory error.
Yes same thing, when I set batch_size=1 it works, but batch_size=2 it does not. However I am only missing a few GB (~2GB), so I was wondering if there is a way to optimize this? And also what happens if I want to distribute the training across multiple gpus, if I set batch_size=1, is it going to be 1 batch per GPU? Or the 1 batch will be distributed across the GPUs?
Because if it is a batch of size 1, then wouldn't we have issue with converging?
@Kev1MSL Hello, I encountered several problems in the training process, the structure of my dataset is as the picture says, but my training profile will not be written, I would like to ask for your help, thank you very much for your reply
@Kev1MSL hello,i am trying to run the training process,but i don't know how to construct the dataset ,can i have a look at the structure of dataset?thank you very much for your reply
@Kev1MSL Hello, may I ask have you made any change to the code ? because I am training the model on the A100 GPU, not even able to train with batch size =1.
@fffh1 Hello, did you solve it? I meet the same problem
Hi Check your depth image dimension, shuold be one rather than rgb or rgba. Regards, Feng
From: ustbzgn @.> Sent: Saturday, July 13, 2024 6:26 PM To: TencentARC/InstantMesh @.> Cc: feng hu @.>; Mention @.> Subject: Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98)
@fffh1https://github.com/fffh1 Hello, did you solve it? I meet the same problem
— Reply to this email directly, view it on GitHubhttps://github.com/TencentARC/InstantMesh/issues/98#issuecomment-2226820606, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7EZSWPZJNYTDGBX7RPCOLZMDQEPAVCNFSM6AAAAABIRJIL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAZDANRQGY. You are receiving this because you were mentioned.Message ID: @.***>
I am very thanks
---- Replied Message ---- | From | feng @.> | | Date | 07/13/2024 18:07 | | To | TencentARC/InstantMesh @.> | | Cc | ustbzgn @.>, Comment @.> | | Subject | Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98) |
Hi Check your depth image dimension, shuold be one rather than rgb or rgba. Regards, Feng
From: ustbzgn @.> Sent: Saturday, July 13, 2024 6:26 PM To: TencentARC/InstantMesh @.> Cc: feng hu @.>; Mention @.> Subject: Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98)
@fffh1https://github.com/fffh1 Hello, did you solve it? I meet the same problem
— Reply to this email directly, view it on GitHubhttps://github.com/TencentARC/InstantMesh/issues/98#issuecomment-2226820606, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7EZSWPZJNYTDGBX7RPCOLZMDQEPAVCNFSM6AAAAABIRJIL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAZDANRQGY. You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Hi! I am trying to train the instantmesh model but I am currently facing issues just before the backpropagation where I am getting cuda out of memory error. Have you faced a similar issue when training and how did you solve this? I am also training on 8 GPUs with same memory as H800, as explained in the paper. Thanks!