Open wmzfight opened 9 months ago
Hi, it's so glad to hear your reply.
Please check your configurations to make sure you have utilized the GPU.
I have another one question, when I use 2gpu to train the MAPPO algorithm, but it uses the more training time than the one gpu. This is why? So could you have any solutions to solve this problem, thank you.
I have the same issue, I'm using the following code to kick off a job. No GPU utilization increase seen, and mostly only 1 CPU is used at 100%, others stay idle
from marllib import marl
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)
mappo = marl.algos.mappo(hyperparam_source='mpe')
model = marl.build_model(env, mappo, {"core_arch": "gru", "encode_layer": "128-256"})
# start training
mappo.fit(env, model,
stop={'timesteps_total': 100000},
checkpoint_freq=100,
share_policy='group',
num_gpus=1,
num_workers=32,
)
E.g. top output
top - 17:09:04 up 6 days, 1:03, 5 users, load average: 1.16, 1.05, 0.88
Tasks: 556 total, 2 running, 554 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.6 us, 0.5 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 257572.3 total, 249544.2 free, 3948.5 used, 4079.7 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 251594.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
186567 satsingh 20 0 78.8g 889732 189528 R 100.7 0.3 1:46.04 python
186582 satsingh 20 0 918108 27648 11264 S 1.0 0.0 0:00.74 gcs_server
186594 satsingh 20 0 77.9g 73984 11776 S 1.0 0.0 0:01.41 raylet
186685 satsingh 20 0 76.1g 76116 35840 S 1.0 0.0 0:00.92 ray::IDLE
186692 satsingh 20 0 76.1g 76604 36096 S 1.0 0.0 0:00.94 ray::IDLE
186695 satsingh 20 0 76.1g 76628 36352 S 1.0 0.0 0:00.91 ray::IDLE
186717 satsingh 20 0 76.1g 75864 35840 S 1.0 0.0 0:00.97 ray::IDLE
186726 satsingh 20 0 76.1g 76100 35840 S 1.0 0.0 0:00.92 ray::IDLE
186727 satsingh 20 0 76.1g 76080 35840 S 1.0 0.0 0:00.95 ray::IDLE
186730 satsingh 20 0 76.1g 76760 36352 S 1.0 0.0 0:00.92 ray::IDLE
186731 satsingh 20 0 76.1g 76384 36096 S 1.0 0.0 0:00.93 ray::IDLE
I have same problem and understand the problem. You should use this config: GPU accelerate: algo.fit(env, model, local_mode=False, num_gpus=1) CPU accelerate: algo.fit(env, model, local_mode=False, num_workers=5) local_mode should be False. reference link
When I use marrlib to train a model on my custom env, it always shows "(0.0/1.0 accelerator_type:V100S)". I have checked all the possible configs to use GPU devices. Both pytorch and tensorflow can detect all the GPUs on my computer.
However, when I use the example provided by Marllib to train a model on mpe env, it provides the same info. So do you have any idea how to solve this problem?