CUDA out of memory when train epoch after val

1211186431 commented 1 year ago

Hello, I have successfully run the demo, and the results are good. But I encountered an issue while training on the 3090, and I hope you can help me resolve it. I have modified the values of datamodule.dataloader.batch_size and datamodule.dataloader.batch_size, and I am able to successfully train for one epoch. But when it comes to the check_val_every_n_epoch epoch, it requires GPU memory to increase from 3GB to 22GB or even higher, which prevents me from saving the model and continuing the training.

Skype-line commented 1 year ago

You can reduce the “res_up” parameter here.

1211186431 commented 1 year ago

Thank you, now I can start training. I found that the model trained with smpl requires more GPU memory than the one trained with smplx during testing. Is there any way to reduce the GPU memory usage when runningtest.py? Apart from that, in demo.py, only the smplx model can be bound to actions. If you could provide a demo for binding actions to the smpl model, I would greatly appreciate it.

Skype-line commented 1 year ago

May I ask how long you have trained the model with SMPL before testing? I think the problem is our SMPL-X version is using the pretrained template model which already has a roughly human shape while the SMPL version is trained from scratch. If SMPL version is not fully trained, then the isosurface extraction will query more points so as to use more GPU memory. If the model is trained enough, the SMPL should not use more GPU memory during the testing. Of course, you can try to reduce the batch_points or resolution.
The reason why we don't provide the SMPL version of demo.py is that we don't have the SMPL actions for the demo sequence. But you can either convert from SMPL-X to SMPL with the official model transfer code or directly use PyMAF to obtain the SMPL parameters. And then you can easily modify the demo.py to make it run with SMPL parameters by taking a reference to test.py.

1211186431 commented 1 year ago

Thank you for your reply. I know how to solve the problems I encountered during training and testing.

AryamanSharma17 commented 1 year ago

Hello @1211186431 , I am facing the same issue despite reducing batch_size=1 and points_per_frame = 30. Can you let me know how did you complete the training and the validation

1211186431 commented 1 year ago

Hello @1211186431 , I am facing the same issue despite reducing batch_size=1 and points_per_frame = 30. Can you let me know how did you complete the training and the validation

Hope my experience can be useful to you. First of all, I want to ask which GPU you use for training, I use 3090 for training and testing. I suggest that you do not set points_per_frame too small, it is recommended to set it to 6000 and the same as the initial value if you use scan as input. If the points_per_frame is too small, the training effect may be much worse, resulting in more mesh points generated later and consuming more video memory. In addition, you can comment out the plot in val, which will also reduce the memory consumption of val.

felixshing commented 1 year ago

Hello @1211186431 , I am facing the same issue despite reducing batch_size=1 and points_per_frame = 30. Can you let me know how did you complete the training and the validation

Hope my experience can be useful to you. First of all, I want to ask which GPU you use for training, I use 3090 for training and testing. I suggest that you do not set points_per_frame too small, it is recommended to set it to 6000 and the same as the initial value if you use scan as input. If the points_per_frame is too small, the training effect may be much worse, resulting in more mesh points generated later and consuming more video memory. In addition, you can comment out the plot in val, which will also reduce the memory consumption of val.

Hello, thank you for your suggestions! I am using a single A40 GPU to train RGB-D-based model and also meet out of memory problem even after two epochs, as shown below

Somehow it seems that it takes 20GB GPU memory per epoch and it does not release and GPU memory after each epoch... Did you also meet this problem? Regarding plot in val, do you mean


plot_res = self.plot(data, res=128)
                img_all = plot_res['img_all']
                self.logger.experiment.log({"vis": [wandb.Image(img_all)]})

in the xavatar_basic_model.py?

1211186431 commented 1 year ago

Hello @1211186431 , I am facing the same issue despite reducing batch_size=1 and points_per_frame = 30. Can you let me know how did you complete the training and the validation

Hope my experience can be useful to you. First of all, I want to ask which GPU you use for training, I use 3090 for training and testing. I suggest that you do not set points_per_frame too small, it is recommended to set it to 6000 and the same as the initial value if you use scan as input. If the points_per_frame is too small, the training effect may be much worse, resulting in more mesh points generated later and consuming more video memory. In addition, you can comment out the plot in val, which will also reduce the memory consumption of val.

Hello, thank you for your suggestions! I am using a single A40 GPU to train RGB-D-based model and also meet out of memory problem even after two epochs, as shown below

Somehow it seems that it takes 20GB GPU memory per epoch and it does not release and GPU memory after each epoch... Did you also meet this problem? Regarding plot in val, do you mean
plot_res = self.plot(data, res=128)
                img_all = plot_res['img_all']
                self.logger.experiment.log({"vis": [wandb.Image(img_all)]})
in the xavatar_basic_model.py?

Yes, the plot I referred to is indeed the part you mentioned. For the first question, I believe that in the val section, meshes are generated and saved, which might consume more GPU memory. Similarly, the plot section requires rendering images through meshes, which also consumes more GPU memory. That's why I commented it out. Regarding the code section that generates meshes：def extract_mesh(self, smpl_verts, smpl_tfs, smpl_thetas, smpl_exps, canonical=False, with_weights=False, res_up=2, fast_mode=False):

felixshing commented 1 year ago

Hello @1211186431 , I am facing the same issue despite reducing batch_size=1 and points_per_frame = 30. Can you let me know how did you complete the training and the validation

Hope my experience can be useful to you. First of all, I want to ask which GPU you use for training, I use 3090 for training and testing. I suggest that you do not set points_per_frame too small, it is recommended to set it to 6000 and the same as the initial value if you use scan as input. If the points_per_frame is too small, the training effect may be much worse, resulting in more mesh points generated later and consuming more video memory. In addition, you can comment out the plot in val, which will also reduce the memory consumption of val.

Hello, thank you for your suggestions! I am using a single A40 GPU to train RGB-D-based model and also meet out of memory problem even after two epochs, as shown below Somehow it seems that it takes 20GB GPU memory per epoch and it does not release and GPU memory after each epoch... Did you also meet this problem? Regarding plot in val, do you mean
plot_res = self.plot(data, res=128)
                img_all = plot_res['img_all']
                self.logger.experiment.log({"vis": [wandb.Image(img_all)]})
in the xavatar_basic_model.py?
Yes, the plot I referred to is indeed the part you mentioned. For the first question, I believe that in the val section, meshes are generated and saved, which might consume more GPU memory. Similarly, the plot section requires rendering images through meshes, which also consumes more GPU memory. That's why I commented it out. Regarding the code section that generates meshes：def extract_mesh(self, smpl_verts, smpl_tfs, smpl_thetas, smpl_exps, canonical=False, with_weights=False, res_up=2, fast_mode=False):

Thank you very much! Yes, you are right, that error occurs in the val section. After commenting out plot, I have successfully trained the third epoch. May I ask you a few more questions?

First, what is the meaning of res_up? It seems it eventually calls mesh_extractor = mise.MISE(res_init, res_up, level_set). MISE should be multiresolution isosurface extraction. Thus, I guess res_up represents upsampling the resolution of generated mesh? The resolution of original mesh is 64. If res_up = 3, then generated mesh is 6422*2 = 512. Am I correct? I found in the original code, sometimes res_up = 3 and sometimes res_up = 4. Do you know how we should determine this?

Second, how many epochs we will train and how can I tune it? The only parameters seems related to training epoch is max_steps in experiments.trainer. But its value is 45,000. Does it mean that we need to train 45,000 epochs?.... Or each step represents training one frame?

Third, how should I determine which model we should used for testing/generating new avatar? There are a lot of loss and I am kind of lost..

felixshing commented 1 year ago

@1211186431 Hey, sorry to bother you again. May I ask have you successfully trained X-avatar on your dataset? I tried to do that but somehow the inference results are even not a human body but like the follwing one...

1211186431 commented 1 year ago

First, regarding the issue with res_up I think you are correct. It is used to set the resolution of the output mesh. res_up=4 will render a clearer mesh compared to res_up=3 but it will also consume more GPU memory. As for how to set it, I believe it is a matter of personal preference since the visualization results during validation should not affect the network training.

Next, concerning training on your own dataset, I'm sorry to say that I haven't done that yet. However, I am interested in your results. Could you please let me know what format your own dataset is in? Perhaps I can analyze the reasons behind your results based on that.

felixshing commented 1 year ago

@1211186431 Thanks! I would love to share my data with you. Would you mind sending me your email address or wechat (if you have) to my email: rcheng4@gmu.edu? My dataset contains some private information and thus I am not willing to post it online.

Regarding the error, I have located the reason. After debugging, I found that the issue might stem from the part type prediction of each point. All points are being predicted as 'body', resulting in the omission of other parts like the face, hands, etc. Specifically, this code lies in "if self.opt.category_sample" of XHumans_rgbd.py. But I don't know why my dataset generates this issue and I am still debugging it.

Skype-line / X-Avatar

CUDA out of memory when train epoch after val #4