Open geekyutao opened 2 years ago
It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28
?
By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5)
.
It seems that only one GPU is allocated. Have you set
--num_gpus 4 --num_cpus 28
?By the way, if you find CPU resources are not enough, you can set
@ray.remote(num_cpus=0.5)
.
Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5)
. but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.
I changed my machine to V100-16G server, but the problem is still there. It's really wired. I never met this before. I can share some screenshots.
It seems that only one GPU is allocated. Have you set
--num_gpus 4 --num_cpus 28
? By the way, if you find CPU resources are not enough, you can set@ray.remote(num_cpus=0.5)
.Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and
@ray.remote(num_cpus=0.5)
. but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.
btw, I modified this ("@ray.remote(num_cpus=0.5)") in https://github.com/YeWR/EfficientZero/blob/a0c094818d750237d5aa14263a65a1e1e4f2bbcb/core/reanalyze_worker.py#L14 . Is this place right?
I noticed that you set --gpu_actor 4
here. That's why only one GPU is in use (each reanalyze gpu actor takes 0.125 gpu, 4 x 0.125 = 0.5). Could you use more actors and share the full screenshot of nvidia-smi
?
Like this:
Furthermore, I am wondering whether the version of ray is 1.0.0. If you are using the latest version of ray, the main process will share GPU with the remote processes.
To figure out the GPU usages, you can refer to https://docs.ray.io/en/releases-1.0.0/using-ray-with-gpus.html. Here is one easy code demo in python:
import os
import ray
ray.init(num_gpus=4)
@ray.remote(num_gpus=1)
def use_gpu():
print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
ray.get([use_gpu.remote() for _ in range(4)])
You will find that you are able to use multiple GPUs in ray.
Hope this will help you :)
It seems that only one GPU is allocated. Have you set
--num_gpus 4 --num_cpus 28
? By the way, if you find CPU resources are not enough, you can set@ray.remote(num_cpus=0.5)
.Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and
@ray.remote(num_cpus=0.5)
. but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.btw, I modified this ("@ray.remote(num_cpus=0.5)") in
. Is this place right?
That's ok, and you can also modify the line 266 in reanalyze_worker.py
to @ray.remote(num_gpus=0.125, num_cpus=0.5). But it seems that your issue is not attributed to this.
Thank you for your detailed reply. I really appreciate it. Here're some observations/facts:
Many thanks! It must take you a lot of time.
I would say this is a magic. Perhaps due to I used docker in my sever, in this case, some detection function in ray may not work. For example:
It is possible when the remote functions are executed fast. Maybe you can try the remote class.
import os
import ray
import time
ray.init(num_gpus=4)
@ray.remote(num_gpus=1)
class Test():
def __init__(self):
pass
def use_gpu(self):
print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
time.sleep(1)
testers = [Test.remote() for _ in range(4)]
ray.get([tester.use_gpu.remote() for tester in testers])
Hi, I found something wired when training EfficientZero. I trained the agent on a P40 sever which has 4 24G GPUs and 28 CPUs. But all the computed memory was on the first GPU even I have set CUDA_VISIBLE_DEVICES=0,1,2,3. I tried to change @ray.remote(num_gpus), but the problem was still out of there. Did you have any suggestions? Thank you!