NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

gdrcopy_sanity failed when GPU Compute Mode is set to EXCLUSIVE #276

Closed pakmarkthub closed 12 months ago

pakmarkthub commented 1 year ago

When the Compute Mode of GPU is set to EXCLUSIVE_*, some unit tests in gdrcopy_sanity fail. This is because those unit tests fork and try to use the same GPU from two processes.

CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
Assertion "(read(read_fd, &child_data, sizeof(int))) == (sizeof(int))" failed at sanity.cpp:1074
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
Assertion "(read(read_fd, &child_data, sizeof(int))) == (sizeof(int))" failed at sanity.cpp:1074
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
Assertion "sendfd(pair[1], fd) >= 0" failed at sanity.cpp:1595
CUDA error: CUDA_ERROR_DEVICE_UNAVAILABLE
Assertion "CUDA_SUCCESS == result" failed at sanity.cpp:70
Total: 28, Passed: 20, Failed: 8, Waived: 0

List of failed tests:
    invalidation_fork_access_after_free_cumemalloc
    invalidation_fork_access_after_free_vmmalloc
    invalidation_fork_map_and_free_cumemalloc
    invalidation_fork_map_and_free_vmmalloc
    invalidation_unix_sock_shared_fd_gdr_map_cumemalloc
    invalidation_unix_sock_shared_fd_gdr_map_vmmalloc
    invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc
    invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc
Error: Encountered an error or a test failure with status=1
pakmarkthub commented 12 months ago

The fix has been merged. Closing this issue.