allenai / embodied-clip

Official codebase for EmbCLIP
https://arxiv.org/abs/2111.09888
Apache License 2.0
111 stars 11 forks source link

Questions about the embclip-habitat object nav training process #7

Closed xuexidi closed 2 years ago

xuexidi commented 2 years ago

Thank you very much for your wonderful work. I am very interested in embclip-habitat object nav.

I am trying to use your open source code to reproduce the results of your paper on embclip-habitat object nav .

Some questions about the training process: I have trained embclip-habitat object nav for about 8.039M steps (22 hours, on Ubuntu18.04+single GPU+NUM_ENVIRONMENTS: 10), and the reward function curve seems to have changed little. I even doubt whether the reward function curve will continue to rise. I'm sorry, because of the security problem of the school's network information, I can't upload the screenshot of my training curve. I can only describe it in words.

The current(8.039M steps (22 hours, on Ubuntu18.04+single GPU+NUM_ENVIRONMENTS: 10)) indicators are (after smoothing): 1).The the reward function curve achieved 0.7912 (the growth is very slow, and I don't even know whether it will grow in the future) 2).The success value is 0.01428 (the growth is very slow, and I don't even know whether it will grow in the future) 3).SPL is 6.0172e-3 (the growth is very slow, and I don't even know whether it will grow in the future) 4).SfotSPL is 0.137 (the growth trend is obvious, and it is still growing)

Does this training progress look normal? Does this code require additional skills during training? Or can I basically show the results in your paper as long as I let it train itself from beginning to end? Can you provide screenshots of your curves (reward, success, SPL, softspl) in the process of training embclip-habitat object nav?

apoorvkh commented 2 years ago

Please observe that the Habitat models in our paper were trained for 250M steps. You should be able to reproduce the metrics report in our paper if you train to completion with the code and instructions we've provided.

I would not watch the reward curve too closely. It's promising that SoftSPL is continuing to grow, which indicates that the agent is navigating closer to target objects on average. SR and SPL depend on successful completions of episodes, which will not occur often in the initial stages of training. Please train for a longer duration and observe the trends.

I'm closing this issue and you can re-open if you see unexpected results after a more substantial training period.

YicongHong commented 2 years ago

Hello Apoorv, may I ask in your experiments, what exactly is the hardware used (how many and what GPUs) for training and how long does the training take for ObjNav in Habitat and RoboTHOR? Thanks!

apoorvkh commented 2 years ago

Hi Yicong. We trained RoboTHOR ObjectNav models (200M steps) with 8 TITAN X GPUs for ~3 days. We trained Habitat ObjectNav models (250M steps) with the AWS g4dn.metal instance (8 T4 GPUs) for ~4 days.

YicongHong commented 2 years ago

Thank you very much for the info! It will be very helpful for us to choose the right hardware and build base on your work! Cheers 😄

xuexidi commented 2 years ago

Hi Yicong. We trained RoboTHOR ObjectNav models (200M steps) with 8 TITAN X GPUs for ~3 days. We trained Habitat ObjectNav models (250M steps) with the AWS g4dn.metal instance (8 T4 GPUs) for ~4 days.

By the way, I would like to know how much memory was used in training Habitat object navigation?I tried training your Habitat object navigation on clound server(8 Tesla V100 GPUs, 500G RAM),after training for about an hour, the program exited abnormally because the memory was full, so I had to reduce the number of traning scene or the number of GPU(use 7 Tesla V100 GPUs can traning normally --- use about 425 RAM)

YicongHong commented 2 years ago

Hi Yicong. We trained RoboTHOR ObjectNav models (200M steps) with 8 TITAN X GPUs for ~3 days. We trained Habitat ObjectNav models (250M steps) with the AWS g4dn.metal instance (8 T4 GPUs) for ~4 days.

By the way, I would like to know how much memory was used in training Habitat object navigation?I tried training your Habitat object navigation on clound server(8 Tesla V100 GPUs, 500G RAM),after training for about an hour, the program exited abnormally because the memory was full, so I had to reduce the number of traning scene or the number of GPU(use 7 Tesla V100 GPUs can traning normally --- use about 425 RAM)

Hi, I am also training it with the default configurations. It takes about 500G RAM.

xuexidi commented 2 years ago

Hi Yicong. We trained RoboTHOR ObjectNav models (200M steps) with 8 TITAN X GPUs for ~3 days. We trained Habitat ObjectNav models (250M steps) with the AWS g4dn.metal instance (8 T4 GPUs) for ~4 days.

By the way, I would like to know how much memory was used in training Habitat object navigation?I tried training your Habitat object navigation on clound server(8 Tesla V100 GPUs, 500G RAM),after training for about an hour, the program exited abnormally because the memory was full, so I had to reduce the number of traning scene or the number of GPU(use 7 Tesla V100 GPUs can traning normally --- use about 425 RAM)

Hi, I am also training it with the default configurations. It takes about 500G RAM.

It seems that 500G RAM is extremely extreme.Maybe because I have made some adjustments to the original habitat object navigation model in this paper, the RAM requirement will be slightly larger, about 550G

apoorvkh commented 2 years ago

Hi, yes, 500G is pretty extreme but seems like that’s the cost for running Habitat with enough processes (environments) to fill each GPU. The g4dn.metal instance only had 384 GB of RAM, so I remember running into this issue. One solution is to reduce the number of environments to fit your RAM constraints. Another solution is to add swap space (where your server will use disk space as memory after it runs out of RAM), however this will slow down your training time.

xuexidi commented 2 years ago

Hi, yes, 500G is pretty extreme but seems like that’s the cost for running Habitat with enough processes (environments) to fill each GPU. The g4dn.metal instance only had 384 GB of RAM, so I remember running into this issue. One solution is to reduce the number of environments to fit your RAM constraints. Another solution is to add swap space (where your server will use disk space as memory after it runs out of RAM), however this will slow down your training time.

Thanks for your reply! Finally, I use 7 * TeslaV100 and set number of environments to 20,used about 427G RAM,and everything goes well~