Open gunshi opened 2 months ago
Hi, yes I use accelerate for single-machine DDP internally all the time, but I quietly rewrote everything to use accelerate over the summer (long after the paper came out) and this doesn't have a ton of users outside our lab so it's definitely possible you've found a portability issue I need to fix.
Just tested a 4-gpu setup now, here's how it should work:
Then, let's say the regular single-gpu command is:
python 01_basic_gym.py --env Pendulum-v1 --val_interval 10 --horizon 200 --max_seq_len 32 --memory_layers 2 --run_name test_accelerate_agent --buffer_dir buffers
The 4-gpu accelerate version would be:
accelerate launch 01_basic_gym.py --env Pendulum-v1 --val_interval 10 --horizon 200 --max_seq_len 32 --memory_layers 2 --run_name test_accelerate_agent --buffer_dir buffers
And this runs without issue for me:
This seems to work fine using the accelerate launch --multi-gpu --num_processes=4 ....
method too. I tried accelerate v 0.33 and 0.34
Ah I see, thanks!
For others who might run into a similar issue on their cluster: I was able to get around the timeouts by specifying the exact address port to use, so my accelerate args are (for a 2 gpu expt):
accelerate launch --multi_gpu --gpu_ids "0,1" --mixed_precision bf16 --num_processes 2 --main_process_port 29500 ..
Hey! Just wanted to point out a potential issue with the data loader: since the Traj dataset is written such that its get_item doesn't actually use the input index given to it and instead samples a filename randomly, when a multi process run is launched, the seed is actually the same across processes and so all processes will sample the exact same trajectory files which nullifies the point of running multi gpu to get larger batch sizes. (this wouldn't happen if the function was using the input index because the data loader knows to split the indices equally across processes)
I appreciate somebody else checking the code in this detail, thanks!
I've been looking into this. I've manually checked that the trajectory filenames loaded across accelerate processes are always different (across different envs, example scripts, dloader workers, and batch sizes). So everything seems to be fine for my examples at least. But we haven't used accelerate for formal results and it was added long after the dloader... so maybe i'm getting this right by accident.
As far as I understand it, the dloader's rng syncs are handled differently than the other shared modules because they are replicated on every process. But if:
torch
's rng to pick a filename instead of random
, andAccelerator(..., rng_types=["torch"])
then the sampling would break exactly like you're describing. I tested to confirm that. Luckily the main
branch doesn't do 2 & 3, and the new version (refactor-gc
branch #51 ) also doesn't do 1.
Are you doing something unique with your seeding or with the accelerate config? I want to make sure this works outside all my test scripts.
Hey, Thanks for open-sourcing your code! I wanted to ask whether you tested the code with accelerate in DDP mode (--mutli_gpu and --num_processes >1 ). Since the provided command in the readme doesn't use these args, on my machine I have explicitly set num_processes to be 4 for a 4 gpu experiment (accelerate errors out otherwise saying that multi_gpu requires num_processes>1).
When I do this, I get downstream errors in the code related to handling of the multiple processes (for eg. env_utils.py creates a directory for logging if it doesn't exist (in the init() of SequenceWrapper), and when running 4 instances of the code as accelerate would do, there is a clash since 4 processes try to save the directory at the exact same time). I did fix those errors but there are others related to client socket timeout("torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (127.0.0.1, 0).") and I wanted to post here to check before trying to debug them further. I was wondering if you did something to avoid these errors, or whether you were testing a different setup entirely with accelerate (non-multi-gpu, or non-DDP or something else or that doesn't need >1 processes)?
Thanks!