ShannonAI / service-streamer

Boosting your Web Services of Deep Learning Applications.
Apache License 2.0
1.23k stars 187 forks source link

Streamer worker没有正确分配到多台gpu上 #74

Closed rubby33 closed 4 years ago

rubby33 commented 4 years ago

首先,在gpu有两个显卡,我设置worker_num=3,cuda_devices=(0, 1) ,具体代码如下:

streamer = Streamer(SentenceManagedBertModel, batch_size=64, max_latency=0.1, worker_num=3, cuda_devices=(0, 1))

问题描述:启动服务后,所有的python进程都分配在gpu 0 上,没有分配到gpu 1 上。很奇怪。

Thu Jun 18 14:25:51 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A | | 40% 48C P8 12W / 250W | 5222MiB / 11018MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:81:00.0 Off | N/A | | 32% 44C P8 24W / 250W | 10MiB / 11019MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2980 C python 1303MiB | | 0 3132 C ...iangwei/anaconda3/envs/py3.7/bin/python 1303MiB | | 0 3133 C ...iangwei/anaconda3/envs/py3.7/bin/python 1303MiB | | 0 3134 C ...iangwei/anaconda3/envs/py3.7/bin/python 1303MiB | +-----------------------------------------------------------------------------+

rubby33 commented 4 years ago

此外,我打印了对应的log日志,如下

run_forever begin... gpu_id: 1
run_forever begin... gpu_id: 0
run_forever begin... gpu_id: 0
ManagedModel gpud_id: 1
ManagedModel gpud_id: 0
ManagedModel gpud_id: 0
CUDA_VISIBLE_DEVICES: 1
[gpu worker:  3133  init model on gpu: 1
CUDA_VISIBLE_DEVICES: 0
CUDA_VISIBLE_DEVICES: 0
[gpu worker:  3134  init model on gpu: 0
[gpu worker:  3132  init model on gpu: 0

是在原来的ManagedModel增加了一些pirnt 信息,确实是不同worker设置了不同的set_gpu_id,但是貌似并未起作用!

class ManagedModel(object):
    def __init__(self, gpu_id=None):
        self.model = None
        self.gpu_id = gpu_id
        print("ManagedModel gpud_id:",self.gpu_id)
        self.set_gpu_id(self.gpu_id)

    @staticmethod
    def set_gpu_id(gpu_id=None):
        if gpu_id is not None:
            os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
            print("CUDA_VISIBLE_DEVICES:",os.environ["CUDA_VISIBLE_DEVICES"] )

    def init_model(self, *args, **kwargs):
        raise NotImplementedError

    def predict(self, batch: List) -> List:
        raise NotImplementedError
rubby33 commented 4 years ago

已经解决。

按照示例代码TextInfillingModel,重新将我的基于bert分类模型,也对应放到单独py文件中。

对于模型初始化如下操作: self.model.eval()

self.model.to(self.device) #不能使用这个,否则所有worker 都在同一个gpu 0上

    if torch.cuda.is_available():
        self.device ="cuda"
        print("model to cuda")
    else:
        self.device = "cpu"
        print("model to cpu")

    self.model.to(self.device)