NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
715 stars 111 forks source link

[BUG] CUDA context error #1079

Open ngocuyen1207 opened 8 months ago

ngocuyen1207 commented 8 months ago

Bug description

The code only runs on one GPU instead of multiple GPUs when it is on a .py file. When I use a jupyter notebook, there is no problem. It shows a warning: 2023-11-02 14:51:33,718 - distributed.comm.ucx - WARNING - Worker with process ID 3666900 should have a CUDA context assigned to device 1 (b'GPU-969c643a-e088-20fd-2b92-f8369b3da310'), but instead the CUDA context is on device 0 (b'GPU-6fbed52c-1fae-3eec-431d-dbc3c81e26a3'). This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.

Code to reproduce bug

from merlin.core.utils import Distributed
from multiprocessing import freeze_support

if __name__ == '__main__':
    freeze_support()
    with Distributed():
        print('hi')    

Environment details

jperez999 commented 7 months ago

Hello @ngocuyen1207 , please try to run on an updated version of python. We are tightly coupled to the rapids ecosystem and support for python 3.8.0 was dropped a while back, please refer to: https://docs.rapids.ai/install. Let us know if you still experience the same issue with upgraded python.