Open rohitlal125555 opened 5 years ago
This looks like an issue due to numba or features related to numba. The traceback seems to be indicating that memory allocation failed. I'd suggest checking the GPU-memory utilization. Also, try to debug it with environment variable NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0
to disable the deferring of deallocation (see http://numba.pydata.org/numba-doc/latest/cuda/memory.html?highlight=numba_cuda_max_pending_deallocs_count#deallocation-behavior for details).
If you have further problem with numba, I'd suggest opening a ticket at https://github.com/numba/numba/issues.
Also, it would be useful to know how the threads using the GPUs. Are all threads using the same GPU? Or, is each thread assigned to a GPU?
@sklam I'd suggest checking the GPU-memory utilization. Also, try to debug it with environment variable NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=0 to disable the deferring of deallocation
I've monitored the GPU memory utilization but it is not much. I'm having 4 GPUs of 32GB each on my machine and there is hardly 4 to 5 GBs of memory used & at max 15-20% use in terms of Volatile GPU Utilisation
Output of nvidia-smi below:
Also, it would be useful to know how the threads using the GPUs. Are all threads using the same GPU? Or, is each thread assigned to a GPU?
I've different threads/processes running on all 4 different GPUs. Although the load is not ideally balanced among the 4 GPUs but it doesn't differ so much between 4 devices.
I'll try and test with equal load distribution among 4 GPUs as well & see if the problem gets resolved.
Actual Behavior
Python crashes with Fatal Python error: Aborted
Expected Behavior
Should run continuously forever (at least for months without any restart required) with no crash errors.
Detailed Description
Operating System:
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo) Kernel: Linux 3.10.0-693.el7.x86_64 Architecture: x86-64
conda info
conda list --show-channel-urls