UT-Austin-RPL / amago

a simple and scalable agent for training adaptive policies with sequence-based RL
https://ut-austin-rpl.github.io/amago/
MIT License
93 stars 4 forks source link

Training speed #63

Closed Lan131 closed 1 day ago

Lan131 commented 1 week ago

I was wondering how fast this is supposed to run? Over 6 hours it only trained 2k steps according to the wandb output and I have an A6000.

I want to run it for 1 million env steps. I believe that in metaworld max time steps per episode is 150. What does the parallel_actors argument doing? Maybe thats slowing it.

python ./examples/07_metaworld.py --run_name metaworld_ml45 --benchmark ml45 --buffer_dir buffer --parallel_actors 40 --memory_size 320 --timesteps_per_epoch 150 --agent_type multitask --max_seq_len 256 --memory_layers 3 --dset_max_size 25000 --epochs 6667 --val_interval 40

jakegrigsby commented 1 week ago

Hi @Lan131, the wandb link i put in the other issue (#62) has a full history of the command already done for you (on an A5000).

https://wandb.ai/jakegrigsby/amago-v3-reference/runs/gq9s8vxs?nw=nwuserjakegrigsby

The default x-axis of wandb "Step" doesn't mean anything useful --- it's the number of times wandb.log has been called. Instead, you want to display results by Wall Time, total_frames, Epoch, or the total frames of any particular task name (just search "total_frames" and you'll see 45 extra options).

The original meta-world RL^2 results are listed at 400M total timesteps (summing over the 45 tasks). The result in our paper Figure 4 goes for 100M. The wandb above happens to go to 200M before I stopped it. If you're looking for 1M total timesteps with an RL^2-like algo... I wouldn't recommend that... but I assume your paper is specifically about sample efficiency. It would take a totally different set of hparams to give that a reasonable shot.

Screen Shot 2024-11-17 at 3 23 50 PM

The data collection process per epoch is described in a new tutorial here.

Screen Shot 2024-11-17 at 3 29 09 PM

The parallel actors (40 in the command you're using), are sampling randomly from the 45 tasks each meta-reset. The ratio of experience collected to learning updates is the main factor in sample efficiency. You'd probably want to use fewer actors and do more gradient steps. Also, you'd probably want to use a shorter policy sequence length than the 256 used in the command. The "meta" aspect of MetaWorld ML45 actually doesn't take much memory and shorter policy lengths are generally more sample efficient. Something like 32 or even 16 might learn faster.

Episodes last for 500 timesteps, which is based on the garage max_pax_length. Some papers change it and I just pushed a commit that lets you change it if you want. Shorter would probably be better for sample efficiency because the way success rates are computed in Meta-World means that the agent is just kinda sitting there wasting steps once it's finished with most of the tasks.

Lan131 commented 1 week ago

Quick question, I thought val/Average Total Successes in peg-unplug-side-v2 was a success rate, but its 3 on that plot. Is it the average number of successes in an episode?

Thanks,

Michael Lanier

On Sun, Nov 17, 2024 at 3:41 PM Jake Grigsby @.***> wrote:

Hi @Lan131 https://github.com/Lan131, the wandb link i put in the other issue (#62 https://github.com/UT-Austin-RPL/amago/issues/62) has a full history of the command already done for you.

https://wandb.ai/jakegrigsby/amago-v3-reference/runs/gq9s8vxs?nw=nwuserjakegrigsby

The default x-axis of wandb "Step" doesn't mean anything useful --- it's the number of times wandb.log has been called. Instead, you want to display results by Wall Time, total_frames, Epoch, or the total frames of any particular task name (just search "total_frames" and you'll see 45 extra options).

The original meta-world RL^2 results are listed at 400M total timesteps (summing over the 45 tasks). The result in our paper Figure 4 goes for 100M. The wandb above happens to go to 200M before I stopped it. If you're looking for 1M total timesteps with an RL^2-like algo... I wouldn't recommend that... but I assume your paper is specifically about sample efficiency. It would take a totally different set of hparams to give that a reasonable shot. Screen.Shot.2024-11-17.at.3.23.50.PM.png (view on web) https://github.com/user-attachments/assets/a24baa79-7380-4ab8-8877-445806358a2a

The data collection process per epoch is described in a new tutorial here https://github.com/UT-Austin-RPL/amago/blob/refactor-gc/tutorial.md. Screen.Shot.2024-11-17.at.3.29.09.PM.png (view on web) https://github.com/user-attachments/assets/4441fcab-50ef-4b3e-9a12-b62fb01195b6

The parallel actors (40 in the command you're using), are sampling randomly from the 45 tasks each meta-reset. The ratio of experience collected to learning updates is the main factor in sample efficiency. You'd probably want to use fewer actors and do more gradient steps. Also, you'd probably want to use a shorter policy sequence length than the 256 used in the command. The "meta" aspect of MetaWorld ML45 actually doesn't take much memory and shorter policy lengths are generally more sample efficient. Something like 32 or even 16 might learn faster.

Episodes last for 500 timesteps, which is based on the garage max_pax_length. Some papers change it and I just pushed a commit that lets you change it if you want. Shorter would probably be better for sample efficiency because the way success rates are computed in Meta-World means that the agent is just kinda sitting there wasting steps once it's finished with most of the tasks.

— Reply to this email directly, view it on GitHub https://github.com/UT-Austin-RPL/amago/issues/63#issuecomment-2481606715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEGCGAQYBIPHWAI7N6VXIIL2BEEPVAVCNFSM6AAAAABR3JPBD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBRGYYDMNZRGU . You are receiving this because you were mentioned.Message ID: @.***>

Lan131 commented 1 week ago

Oh never mind I see, there's a separate plot for the success rate.

ML

On Sun, Nov 17, 2024 at 6:49 PM michael lanier @.***> wrote:

Quick question, I thought val/Average Total Successes in peg-unplug-side-v2 was a success rate, but its 3 on that plot. Is it the average number of successes in an episode?

Thanks,

Michael Lanier

On Sun, Nov 17, 2024 at 3:41 PM Jake Grigsby @.***> wrote:

Hi @Lan131 https://github.com/Lan131, the wandb link i put in the other issue (#62 https://github.com/UT-Austin-RPL/amago/issues/62) has a full history of the command already done for you.

https://wandb.ai/jakegrigsby/amago-v3-reference/runs/gq9s8vxs?nw=nwuserjakegrigsby

The default x-axis of wandb "Step" doesn't mean anything useful --- it's the number of times wandb.log has been called. Instead, you want to display results by Wall Time, total_frames, Epoch, or the total frames of any particular task name (just search "total_frames" and you'll see 45 extra options).

The original meta-world RL^2 results are listed at 400M total timesteps (summing over the 45 tasks). The result in our paper Figure 4 goes for 100M. The wandb above happens to go to 200M before I stopped it. If you're looking for 1M total timesteps with an RL^2-like algo... I wouldn't recommend that... but I assume your paper is specifically about sample efficiency. It would take a totally different set of hparams to give that a reasonable shot. Screen.Shot.2024-11-17.at.3.23.50.PM.png (view on web) https://github.com/user-attachments/assets/a24baa79-7380-4ab8-8877-445806358a2a

The data collection process per epoch is described in a new tutorial here https://github.com/UT-Austin-RPL/amago/blob/refactor-gc/tutorial.md. Screen.Shot.2024-11-17.at.3.29.09.PM.png (view on web) https://github.com/user-attachments/assets/4441fcab-50ef-4b3e-9a12-b62fb01195b6

The parallel actors (40 in the command you're using), are sampling randomly from the 45 tasks each meta-reset. The ratio of experience collected to learning updates is the main factor in sample efficiency. You'd probably want to use fewer actors and do more gradient steps. Also, you'd probably want to use a shorter policy sequence length than the 256 used in the command. The "meta" aspect of MetaWorld ML45 actually doesn't take much memory and shorter policy lengths are generally more sample efficient. Something like 32 or even 16 might learn faster.

Episodes last for 500 timesteps, which is based on the garage max_pax_length. Some papers change it and I just pushed a commit that lets you change it if you want. Shorter would probably be better for sample efficiency because the way success rates are computed in Meta-World means that the agent is just kinda sitting there wasting steps once it's finished with most of the tasks.

— Reply to this email directly, view it on GitHub https://github.com/UT-Austin-RPL/amago/issues/63#issuecomment-2481606715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEGCGAQYBIPHWAI7N6VXIIL2BEEPVAVCNFSM6AAAAABR3JPBD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBRGYYDMNZRGU . You are receiving this because you were mentioned.Message ID: @.***>

Lan131 commented 1 week ago

Any shot you could invite me to this workspace? mlan is my user name. If I am a collaborator I can make a custom graph (which is the average of all the val/ average Trial success ) for the following tasks: assembly, basketball, bin picking, box close, button press topdown, button press topdown-wall, button press, button press wall, coffee button, coffee pull, coffee push, dial turn, disassemble, door close, door lock, door open, door unlock, drawer close, drawer open , and faucet close

For the first 1 million frames. But I don't seem to be able to add new graphs. Or if you make a view maybe I can mess with it.

Thanks ML

On Sun, Nov 17, 2024 at 6:50 PM michael lanier @.***> wrote:

Oh never mind I see, there's a separate plot for the success rate.

ML

On Sun, Nov 17, 2024 at 6:49 PM michael lanier @.***> wrote:

Quick question, I thought val/Average Total Successes in peg-unplug-side-v2 was a success rate, but its 3 on that plot. Is it the average number of successes in an episode?

Thanks,

Michael Lanier

On Sun, Nov 17, 2024 at 3:41 PM Jake Grigsby @.***> wrote:

Hi @Lan131 https://github.com/Lan131, the wandb link i put in the other issue (#62 https://github.com/UT-Austin-RPL/amago/issues/62) has a full history of the command already done for you.

https://wandb.ai/jakegrigsby/amago-v3-reference/runs/gq9s8vxs?nw=nwuserjakegrigsby

The default x-axis of wandb "Step" doesn't mean anything useful --- it's the number of times wandb.log has been called. Instead, you want to display results by Wall Time, total_frames, Epoch, or the total frames of any particular task name (just search "total_frames" and you'll see 45 extra options).

The original meta-world RL^2 results are listed at 400M total timesteps (summing over the 45 tasks). The result in our paper Figure 4 goes for 100M. The wandb above happens to go to 200M before I stopped it. If you're looking for 1M total timesteps with an RL^2-like algo... I wouldn't recommend that... but I assume your paper is specifically about sample efficiency. It would take a totally different set of hparams to give that a reasonable shot. Screen.Shot.2024-11-17.at.3.23.50.PM.png (view on web) https://github.com/user-attachments/assets/a24baa79-7380-4ab8-8877-445806358a2a

The data collection process per epoch is described in a new tutorial here https://github.com/UT-Austin-RPL/amago/blob/refactor-gc/tutorial.md. Screen.Shot.2024-11-17.at.3.29.09.PM.png (view on web) https://github.com/user-attachments/assets/4441fcab-50ef-4b3e-9a12-b62fb01195b6

The parallel actors (40 in the command you're using), are sampling randomly from the 45 tasks each meta-reset. The ratio of experience collected to learning updates is the main factor in sample efficiency. You'd probably want to use fewer actors and do more gradient steps. Also, you'd probably want to use a shorter policy sequence length than the 256 used in the command. The "meta" aspect of MetaWorld ML45 actually doesn't take much memory and shorter policy lengths are generally more sample efficient. Something like 32 or even 16 might learn faster.

Episodes last for 500 timesteps, which is based on the garage max_pax_length. Some papers change it and I just pushed a commit that lets you change it if you want. Shorter would probably be better for sample efficiency because the way success rates are computed in Meta-World means that the agent is just kinda sitting there wasting steps once it's finished with most of the tasks.

— Reply to this email directly, view it on GitHub https://github.com/UT-Austin-RPL/amago/issues/63#issuecomment-2481606715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEGCGAQYBIPHWAI7N6VXIIL2BEEPVAVCNFSM6AAAAABR3JPBD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBRGYYDMNZRGU . You are receiving this because you were mentioned.Message ID: @.***>

jakegrigsby commented 1 week ago

Yeah the average total successes is [0, k], and we use k=3 by default. The success rate by episode is [0, 1].

If you're talking about 1M frames/timesteps per task (so 45M total) that's a lot more feasible with some minor tweaks to this command. I'd still recommend cutting the --max_seq_len from 256 to 128 or maybe lower.