How long it takes to train the model

jingma-git commented 3 weeks ago

I run the model in RTX-4090, but it takes 10 hour to train 60 epochs. I use the following command

python 3_train_policy.py \
--config ../digital_cousins/configs/training/bc_base.json \
--dataset test_demos.hdf5 \
--auto-remove-exp

I search the bc_base.json, it show the num_epochs: 3000

And the maximum success rate during this 60 epochs is only 0.16

Epoch 46 Rollouts took 48.938840579986575s (avg) with results:
Env: OpenCabinetWrapper
{
    "Exception_Rate": 0.0,
    "Horizon": 97.92,
    "Return": 0.16,
    "Success_Rate": 0.16,
    "Time_Episode": 20.391183574994404,
    "time": 48.938840579986575
}

I think this maybe because I only collect 6 demos by running the following command

python 1_collect_demos.py \
--scene_path ../tests/acdc_output/step_3_output/scene_0/scene_0_info.json \
--target_obj cabinet_4 \
--target_link link_1 \
--cousins bottom_cabinet,bamfsz,link_1 bottom_cabinet_no_top,vdedzt,link_0 \
--dataset_path test_demos.hdf5 \
--n_demos_per_model 3 \
--eval_cousin_id 0 \
--seed 0

Can you help me analyze why is this happening? It would be much better if the author can add the training details to ReadME.md Thank you for your brilliant work!

jingma-git commented 3 weeks ago

Here is the visulization of the trainning loss

andyaloha commented 3 weeks ago

@jingma-git The process of rollout takes long time. You can decouple the rollout rate with save ckpt rate (https://github.com/ARISE-Initiative/robomimic/blob/9273f9cce85809b4f49cb02c6b4d4eeb2fe95abb/robomimic/scripts/train.py#L256) or modify save ckpt rate (https://github.com/cremebrule/digital-cousins/blob/f1c699705d03ea30f3dabe97ba095eeaafafd1b3/digital_cousins/configs/training/bc_base.json#L15) to shorten training time.

RogerDAI1217 commented 3 weeks ago

Hi @jingma-git, for our experiments, we train 3000 epoches with 30 rollouts per 100 epochs. I would estimate 20 hours if you use 4090.

Thanks @andyaloha

RogerDAI1217 commented 3 weeks ago

@jingma-git The exact convergence speed depends on how many digital cousins you use for training and how much shape/orientation/position/point cloud randomization you applied.

If you use default settings with 4 digital cousins with similar geometric affordances with the target object, it will converge within 2000-3000 epochs.

cremebrule commented 2 weeks ago

FYI @jingma-git we trained with 10,000 demonstrations total to achieve our results, so you'll probably have much better success by increasing the number of collected demos!

cremebrule commented 5 days ago

Closing this issue for now as there's been no response for a few weeks. Feel free to re-open if you continue to run into issues!

cremebrule / digital-cousins

How long it takes to train the model #14