All memory seems on the first GPU

geekyutao commented 2 years ago

Hi, I found something wired when training EfficientZero. I trained the agent on a P40 sever which has 4 24G GPUs and 28 CPUs. But all the computed memory was on the first GPU even I have set CUDA_VISIBLE_DEVICES=0,1,2,3. I tried to change @ray.remote(num_gpus), but the problem was still out of there. Did you have any suggestions? Thank you! 1fb124062c119a926ba497720b2019d f47bfe7cd4a733f4923b2386729227d

YeWR commented 2 years ago

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?

By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

geekyutao commented 2 years ago

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?

By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

geekyutao commented 2 years ago

I changed my machine to V100-16G server, but the problem is still there. It's really wired. I never met this before. I can share some screenshots. d3a69b45b1ee2d8812b1d42d71552c2 ab5fae6fcae5721c2a980f53c5b3b79

geekyutao commented 2 years ago

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28? By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

btw, I modified this ("@ray.remote(num_cpus=0.5)") in https://github.com/YeWR/EfficientZero/blob/a0c094818d750237d5aa14263a65a1e1e4f2bbcb/core/reanalyze_worker.py#L14 . Is this place right?

YeWR commented 2 years ago

I noticed that you set --gpu_actor 4 here. That's why only one GPU is in use (each reanalyze gpu actor takes 0.125 gpu, 4 x 0.125 = 0.5). Could you use more actors and share the full screenshot of nvidia-smi? Like this:

Furthermore, I am wondering whether the version of ray is 1.0.0. If you are using the latest version of ray, the main process will share GPU with the remote processes.

To figure out the GPU usages, you can refer to https://docs.ray.io/en/releases-1.0.0/using-ray-with-gpus.html. Here is one easy code demo in python:

import os
import ray
ray.init(num_gpus=4)

@ray.remote(num_gpus=1)
def use_gpu():
    print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
    print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))

ray.get([use_gpu.remote() for _ in range(4)])

You will find that you are able to use multiple GPUs in ray.

Hope this will help you :)

YeWR commented 2 years ago

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28? By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

btw, I modified this ("@ray.remote(num_cpus=0.5)") in

https://github.com/YeWR/EfficientZero/blob/a0c094818d750237d5aa14263a65a1e1e4f2bbcb/core/reanalyze_worker.py#L14

. Is this place right?

That's ok, and you can also modify the line 266 in reanalyze_worker.py to @ray.remote(num_gpus=0.125, num_cpus=0.5). But it seems that your issue is not attributed to this.

geekyutao commented 2 years ago

Thank you for your detailed reply. I really appreciate it. Here're some observations/facts:

I use ray=1.0.0 version as in your requirements.txt. https://github.com/YeWR/EfficientZero/blob/a0c094818d750237d5aa14263a65a1e1e4f2bbcb/requirements.txt#L2
I also upgraded the ray to 1.9 version. But met some new errors such as the follows.
The 1.0.0 ray version actually works well. I ran the code with 1.0.0 ray in my local machine (2 x 2080Ti) and the results seemed normal except for out of memory.
Unfortunately, my servers (such as 4 x 24G P40, 4x 16G V100, etc) in my GPU cluster cannot show the full nvidia-smi results due to some unknown mechanisms.
It's wired that local machines can allocate memory/workers while severs cannot. I'm still confused about the principle of ray.
I’ll try the demo in your reply.

Many thanks! It must take you a lot of time.

geekyutao commented 2 years ago

I would say this is a magic. Perhaps due to I used docker in my sever, in this case, some detection function in ray may not work. For example:

YeWR commented 2 years ago

It is possible when the remote functions are executed fast. Maybe you can try the remote class.

import os
import ray
import time
ray.init(num_gpus=4)

@ray.remote(num_gpus=1)
class Test():
    def __init__(self):
        pass

    def use_gpu(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
        print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
        time.sleep(1)

testers = [Test.remote() for _ in range(4)]
ray.get([tester.use_gpu.remote() for tester in testers])

YeWR / EfficientZero

All memory seems on the first GPU #9