Closed jingma-git closed 5 days ago
Here is the visulization of the trainning loss
@jingma-git The process of rollout takes long time. You can decouple the rollout rate with save ckpt rate (https://github.com/ARISE-Initiative/robomimic/blob/9273f9cce85809b4f49cb02c6b4d4eeb2fe95abb/robomimic/scripts/train.py#L256) or modify save ckpt rate (https://github.com/cremebrule/digital-cousins/blob/f1c699705d03ea30f3dabe97ba095eeaafafd1b3/digital_cousins/configs/training/bc_base.json#L15) to shorten training time.
Hi @jingma-git, for our experiments, we train 3000 epoches with 30 rollouts per 100 epochs. I would estimate 20 hours if you use 4090.
Thanks @andyaloha
@jingma-git The exact convergence speed depends on how many digital cousins you use for training and how much shape/orientation/position/point cloud randomization you applied.
If you use default settings with 4 digital cousins with similar geometric affordances with the target object, it will converge within 2000-3000 epochs.
FYI @jingma-git we trained with 10,000 demonstrations total to achieve our results, so you'll probably have much better success by increasing the number of collected demos!
Closing this issue for now as there's been no response for a few weeks. Feel free to re-open if you continue to run into issues!
I run the model in RTX-4090, but it takes 10 hour to train 60 epochs. I use the following command
I search the bc_base.json, it show the num_epochs: 3000
And the maximum success rate during this 60 epochs is only 0.16
I think this maybe because I only collect 6 demos by running the following command
Can you help me analyze why is this happening? It would be much better if the author can add the training details to ReadME.md Thank you for your brilliant work!