Query Regarding Training Time and Default Parameters on RLBench Task with 8 A100 GPUs

NVlabs / RVT

Official Code for RVT-2 and RVT

https://robotic-view-transformer-2.github.io/

Other

280 stars 34 forks source link

Query Regarding Training Time and Default Parameters on RLBench Task with 8 A100 GPUs #8

Closed BeckywithYaoji closed 1 year ago

BeckywithYaoji commented 1 year ago

Hello there,I hope you're doing well. I wanted to share my experience with your project. I've been using 8 A100 GPUs, each with 40GB of memory, to train an RLBench task. However, I've noticed that training a single epoch takes around 12 hours, and I'm using the default parameters provided by you. I'm curious if there might be an issue somewhere in my setup. I'd greatly appreciate your insights and guidance on this matter. Thank you for your time and effort in developing this project. Looking forward to your response.

imankgoyal commented 1 year ago

Hi @BeckywithYaoji,

This is very unlike what we observed so I would guess there is some issue in your setup. For reference, we used 8 V100 GPUs with 16 GB of memory each, and the training time was ~1.5 hours per epoch.

I would suggest to make sure you are using the GPUs (you can use the command nvidia-smi). Another thing to look into the data loading time to make sure there is no IO bottleneck. Let me know what you find out.

imankgoyal commented 1 year ago

Closing because of inactivity. Please feel free to reopen.