Replicable-MARL / MARLlib

One repository is all that is necessary for Multi-agent Reinforcement Learning (MARL)
https://marllib.readthedocs.io
MIT License
873 stars 141 forks source link

Marllib seems never uses gpu devices #203

Open wmzfight opened 9 months ago

wmzfight commented 9 months ago

When I use marrlib to train a model on my custom env, it always shows "(0.0/1.0 accelerator_type:V100S)". I have checked all the possible configs to use GPU devices. Both pytorch and tensorflow can detect all the GPUs on my computer.

However, when I use the example provided by Marllib to train a model on mpe env, it provides the same info. So do you have any idea how to solve this problem? 微信图片_20231123145904

mrvgao commented 9 months ago

Hi, it's so glad to hear your reply.

Please check your configurations to make sure you have utilized the GPU.

libin-star commented 8 months ago

I have another one question, when I use 2gpu to train the MAPPO algorithm, but it uses the more training time than the one gpu. This is why? So could you have any solutions to solve this problem, thank you. 11111

satpreetsingh commented 6 months ago

I have the same issue, I'm using the following code to kick off a job. No GPU utilization increase seen, and mostly only 1 CPU is used at 100%, others stay idle


from marllib import marl
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)
mappo = marl.algos.mappo(hyperparam_source='mpe')
model = marl.build_model(env, mappo, {"core_arch": "gru", "encode_layer": "128-256"})

# start training
mappo.fit(env, model, 
  stop={'timesteps_total': 100000}, 
  checkpoint_freq=100,
  share_policy='group', 
  num_gpus=1,
  num_workers=32,
  )

E.g. top output

top - 17:09:04 up 6 days,  1:03,  5 users,  load average: 1.16, 1.05, 0.88
Tasks: 556 total,   2 running, 554 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.6 us,  0.5 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257572.3 total, 249544.2 free,   3948.5 used,   4079.7 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used. 251594.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                       
 186567 satsingh  20   0   78.8g 889732 189528 R 100.7   0.3   1:46.04 python                        
 186582 satsingh  20   0  918108  27648  11264 S   1.0   0.0   0:00.74 gcs_server                    
 186594 satsingh  20   0   77.9g  73984  11776 S   1.0   0.0   0:01.41 raylet                        
 186685 satsingh  20   0   76.1g  76116  35840 S   1.0   0.0   0:00.92 ray::IDLE                     
 186692 satsingh  20   0   76.1g  76604  36096 S   1.0   0.0   0:00.94 ray::IDLE                     
 186695 satsingh  20   0   76.1g  76628  36352 S   1.0   0.0   0:00.91 ray::IDLE                     
 186717 satsingh  20   0   76.1g  75864  35840 S   1.0   0.0   0:00.97 ray::IDLE                     
 186726 satsingh  20   0   76.1g  76100  35840 S   1.0   0.0   0:00.92 ray::IDLE                     
 186727 satsingh  20   0   76.1g  76080  35840 S   1.0   0.0   0:00.95 ray::IDLE                     
 186730 satsingh  20   0   76.1g  76760  36352 S   1.0   0.0   0:00.92 ray::IDLE                     
 186731 satsingh  20   0   76.1g  76384  36096 S   1.0   0.0   0:00.93 ray::IDLE     
SoheilSedghi commented 1 month ago

I have same problem and understand the problem. You should use this config: GPU accelerate: algo.fit(env, model, local_mode=False, num_gpus=1) CPU accelerate: algo.fit(env, model, local_mode=False, num_workers=5) local_mode should be False. reference link