All tests failed due to `cuda error: operation not supported (code=801)`

After building (on Ubuntu 22) the library, I attempted to run the tests. Unfortunately, all tests failed, and the log says that this is due to an unrecoverable error, cuda error: operation not supported (code=801). The full logs are shown below. Please let me know if you need some more information about my environment. I appreciate it very much if you could give me some ideas/suggestions to fix this issue.

Here is the log:

$ ./setup.py --clean-build --with-pcl
$ ./build/bin/test/RobotecGPULidar_test 
Running main() from /home/nimda/RobotecGPULidar/external/googletest/googletest/src/gtest_main.cc
[==========] Running 366 tests from 65 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EndToEnd
[ RUN      ] EndToEnd.ReadmeExample
[Wed Jul 24 22:13:56 2024]: Logging configured: level=info, file=(disabled), stdout=true
[22:13:56][    17 us][info]: RGL Version 0.17.0 branch=develop commitSHA1=f8b9207d05f5c9fddfb481c13d728c9feeb76875
[22:13:56][ 23455 us][info]: Running on GPU: NVIDIA H100-20C
[22:13:56][    24 us][info]: Built against OptiX SDK version: 7.2.0
[22:13:56][     2 us][info]: Built against OptiX ABI version: 41
[22:13:56][     2 us][info]: Built against CUDA Toolkit version: 12.2
[22:13:56][     8 us][info]: Installed CUDA runtime version: 12.2
[22:13:56][     2 us][info]: Installed CUDA driver version: 12.2
[22:13:56][ 10286 us][info]: Installed NVidia kernel driver version: 535.129.03
RGL version: 0.17.0
[22:14:18][22478887 us][critical]: Unrecoverable error (code=500): cuda error: operation not supported (code=801) @ /home/nimda/RobotecGPULidar/test/../src/memory/MemoryOperations.hpp:85
/home/nimda/RobotecGPULidar/test/src/apiReadmeExample.cpp:35: Failure
Expected equality of these values:
  rgl_node_rays_from_mat3x4f(&useRays, &ray_tf, 1)
    Which is: 500
  rgl_status_t::RGL_SUCCESS
    Which is: 0
[22:14:18][   160 us][critical]: Logging disabled due to the previous fatal error
/home/nimda/RobotecGPULidar/test/src/apiReadmeExample.cpp:36: Failure
Expected equality of these values:
  rgl_node_raytrace(&raytrace, nullptr)
    Which is: 2
  rgl_status_t::RGL_SUCCESS
    Which is: 0
/home/nimda/RobotecGPULidar/test/src/apiReadmeExample.cpp:37: Failure

Some information about my environment (can also be seen from the log above): Ubuntu 22.04 NVIDIA-SMI 535.129.03
Driver Version: 535.129.03 Cuda version: 12.2

The GPU is NVIDIA virtual GPU H100, I suspect this is what's causing the problem but I'm not sure.

[22:14:18][22478887 us][critical]: Unrecoverable error (code=500): cuda error: operation not supported (code=801) @ /home/nimda/RobotecGPULidar/test/../src/memory/MemoryOperations.hpp:85

This points to:

CHECK_CUDA(cudaMallocAsync(&ptr, bytes, stream->getHandle()));

Which used so-called stream-ordered memory allocation, which has been added in CUDA 11.2

I think H100 should handle that, I have no idea why the virtualized one fails to. However, let's try to verify the diagnosis - compile and run the following code:

#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    int device;
    cudaDeviceProp deviceProp;

    // Get the current device
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&deviceProp, device);

    int memoryPoolsSupported = 0;
    cudaDeviceGetAttribute(&memoryPoolsSupported, cudaDevAttrMemoryPoolsSupported, device);

    if (memoryPoolsSupported) {
        printf("Stream-ordered memory allocation is supported on this device.\n");
    } else {
        printf("Stream-ordered memory allocation is not supported on this device.\n");
    }

    return 0;
}

If it says the stream-ordered memory allocation is not supported, it means the problem is not specific to RGL and probably your setup needs some debugging. Let me know the results!

What you guessed is true, the stream-ordered memory allocation is not supported in my environment. Is there any way that I can use RGL without the stream-ordered memory allocation?

$ nvcc test.cu -o test
$ ./test 
Stream-ordered memory allocation is not supported on this device.

Yes, this sounds possible, altough we never tested. Here's how to do it:

In MemoryOperations.hpp there's a function that returns a set of methods that manage given kind of memory - e.g. host pageable, host pinned, device synchronous or device asynchronous. So if you modify it to return synchronous functions when requested for asynchronous ones, it should theoretically run all RGL with synchronous memory. You'd need to modify the code to look like this:

else if constexpr (memoryKind == MemoryKind::DeviceSync || memoryKind == MemoryKind::DeviceAsync) {
            return {
                .allocate = [](size_t bytes) {
                    void* ptr = nullptr;
                    CHECK_CUDA(cudaMalloc(&ptr, bytes));
                    return ptr;
                },
                .deallocate = [](void* ptr) {
                    CHECK_CUDA(cudaFree(ptr));
                },
                .copy = [](void* dst, const void* src, size_t bytes) {
                    CHECK_CUDA(cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToDevice));
                },
                .clear = [=](void* dst, int value, size_t bytes) {
                    CHECK_CUDA(cudaMemset(dst, value, bytes));
                }
            };
        }
        else {
            static_assert("invalid memory kind passed to MemoryOperations::get()");
        }

I removed the if-else clause for MemoryKind::DeviceAsync and modified the last if to return the synchronous functions when requested for either sync or async functions.

This will make RGL slower
I'd recommend to find the culprit of Virtual H100 not supporting async memory. To me it seems weird that such basic CUDA feature does not work, since it would kill many real-world applications, which would lower the value of the virtualized H100. I don't believe NVIDIA would allow for that. Maybe there's something obvious missed here.
If you'd like, I'm curious to hear about your use case where you run RGL in this virtual H100 environment!

@prybicki Thank you for suggesting a solution. Unfortunately, even doing so, another error appears. This is part of the log:

$ ./build/bin/test/RobotecGPULidar_test 
Running main() from /home/nimda/RobotecGPULidar/external/googletest/googletest/src/gtest_main.cc
[==========] Running 366 tests from 65 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EndToEnd
[ RUN      ] EndToEnd.ReadmeExample
[Fri Aug 2 22:07:42 2024]: Logging configured: level=info, file=(disabled), stdout=true
[22:07:42][    27 us][info]: RGL Version 0.18.0 branch=main commitSHA1=b287aa501d52e5909413192dfe6ce5b7ec808c7a
[22:07:42][ 22917 us][info]: Running on GPU: NVIDIA H100-20C
[22:07:42][    43 us][info]: Built against OptiX SDK version: 7.2.0
[22:07:42][     8 us][info]: Built against OptiX ABI version: 41
[22:07:42][     7 us][info]: Built against CUDA Toolkit version: 12.2
[22:07:42][    16 us][info]: Installed CUDA runtime version: 12.2
[22:07:42][     8 us][info]: Installed CUDA driver version: 12.2
[22:07:42][ 10290 us][info]: Installed NVidia kernel driver version: 535.129.03
[22:07:42][  3122 us][critical]: Unrecoverable error (code=500): failed to get primary CUDA context: no error (0)

[22:07:42][    59 us][critical]: Unrecoverable error (code=10): failed to get primary CUDA context: no error (0)

[22:07:42][    10 us][critical]: Logging disabled due to the previous fatal error
/home/nimda/RobotecGPULidar/test/src/apiReadmeExample.cpp:8: Failure
Expected equality of these values:
  rgl_get_version_info(&major, &minor, &patch)
    Which is: 2
  rgl_status_t::RGL_SUCCESS
    Which is: 0
RGL version: 32766.-16544.-1

I come from the Autoware user community. I have been trying to set up an environment with Autoware & AWSIM-Labs. Our institute provides a cloud virtual machine with virtual GPU H100. I tried to set up Autoware & AWSIM-Labs in this VM, but after seeing it didn't work, I realized that the root cause was because RGL plugin in AWSIM-Labs failed to launch. Thus, I came here looking for help. Unfortunately, in this cloud environment, I have neither access to the host machine nor permission to update the NVIDIA driver, cuda version (since if I do so, the versions will mismatch with the host machine). Let me know if you need some other specific information.

PS: I also tried with version 0.17.0 (which is the version I tried when opening this issue), but the same error happened.

This CUDA setup seems really broken. Have you tried running CUDA samples?

The error says that it was impossible to get the primary CUDA context (which is a GPU counterpart of a CPU process). Unfortunately, there's a bug in RGL code that caused to error message from the previous operation no error (0). I fixed it on this branch. Could you try to run it and report what's the reason the primary CUDA context is impossible to get? That could lead us somewhere.

Also, sharing output from nvidia-smi is a good idea.

I tried with your last fix, and now I can see the error code:

$ ./build/bin/test/RobotecGPULidar_test 
Running main() from /home/nimda/RobotecGPULidar/external/googletest/googletest/src/gtest_main.cc
[==========] Running 366 tests from 65 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EndToEnd
[ RUN      ] EndToEnd.ReadmeExample
[Sat Aug 3 12:37:55 2024]: Logging configured: level=info, file=(disabled), stdout=true
[12:37:55][    22 us][info]: RGL Version 0.18.0 branch=fix/error-report-primary-ctx commitSHA1=6598df37bfdcd581430512e007b3fee3454acb36
[12:37:55][ 23150 us][info]: Running on GPU: NVIDIA H100-20C
[12:37:55][    23 us][info]: Built against OptiX SDK version: 7.2.0
[12:37:55][     2 us][info]: Built against OptiX ABI version: 41
[12:37:55][     2 us][info]: Built against CUDA Toolkit version: 12.2
[12:37:55][     9 us][info]: Installed CUDA runtime version: 12.2
[12:37:55][     2 us][info]: Installed CUDA driver version: 12.2
[12:37:55][ 10476 us][info]: Installed NVidia kernel driver version: 535.129.03
[12:37:55][  3052 us][critical]: Unrecoverable error (code=500): failed to get primary CUDA context: operation not supported (801)

[12:37:55][    36 us][critical]: Unrecoverable error (code=10): failed to get primary CUDA context: operation not supported (801)

[12:37:55][     3 us][critical]: Logging disabled due to the previous fatal error
/home/nimda/RobotecGPULidar/test/src/apiReadmeExample.cpp:8: Failure
Expected equality of these values:
  rgl_get_version_info(&major, &minor, &patch)
    Which is: 2
  rgl_status_t::RGL_SUCCESS
    Which is: 0
RGL version: 32766.-16544.-1

Searching around the error code 801, it seems that we need to enable the so-called "Unified Memory" for virtual GPU from the host machine. Unfortunately, I have no access to the host machine. I will try to make a request to my Institute to see if it can fix my case, but they might decline my request. Meanwhile, I kindly ask you for a possible way to run RGL in my case, for example, completely disable the async mode. Is there any possible solution?

I tried the cuda-samples, and saw that deviceQuery works:

$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA H100-20C"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    9.0
  Total amount of global memory:                 20281 MBytes (21265973248 bytes)
  (114) Multiprocessors, (128) CUDA Cores/MP:    14592 CUDA Cores
  GPU Max Clock rate:                            1755 MHz (1.75 GHz)
  Memory Clock rate:                             1593 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 52428800 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        233472 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                No
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 7 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1
Result = PASS

However, asyncAPI failed and outputted the same error code as RGL:

$ ./asyncAPI 
[./asyncAPI] - Starting...
CUDA error at ../../../Common/helper_cuda.h:888 code=801(cudaErrorNotSupported) "cudaSetDevice(devID)"

Let me know if you want to try other query tests.

As you request, this is what nvidia-smi outputs:

$ nvidia-smi 
Sat Aug  3 12:54:41 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100-20C                On  | 00000000:07:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 20480MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@duongtd23

Searching around the error code 801, it seems that we need to enable the so-called "Unified Memory" for virtual GPU from the host machine.

Could you link the source? deviceQuery indeed reports that Unified (Virtual) Memory (aka Managed Memory) is not supported. However, I don't see any connection between not having UVM and the failing call to cuDevicePrimaryCtxRetain. It is a pretty basic CUDA call that I'd expect to work everywhere. Moreover, in its documentation there's nothing about the unified memory. However, there's an interesting notice that the function may fail if the compute mode of the device is set to CU_COMPUTEMODE_PROHIBITED. Maybe that will hint something related to the machine configuration.

My point is that it is not only async or unified memory that's preventing RGL from working on your machine.

Meanwhile, I kindly ask you for a possible way to run RGL in my case, for example, completely disable the async mode. Is there any possible solution?

Unfortunately, I don't think there's any easy way to adjust RGL to work with such constraints. As I said, it's not only async calls, but basic GPU context management calls that fail. I've tried to replace this call with another one (cuCtxCreate), but I wasn't able to get it working quickly.

I will try to make a request to my Institute to see if it can fix my case, but they might decline my request.

I hope they realize the current configuration prevents you from running a lot of CUDA programs.

@prybicki Sorry, I forgot to paste the link even though I intended to so do. Check the accepted answer in this post: https://forums.developer.nvidia.com/t/cuda-failure-status-801-error-1-in-buffer-allocation/265302/37.

I am not familiar with Cuda driver API development, but just found that answer by Googling and thought that it might be the solution for my case.

My point is that it is not only async or unified memory that's preventing RGL from working on your machine.

Checking what you wrote, however, I doubt that enabling "unified memory" will solve my issue. Anyway, let me try it to see what happens. Thanks so much for your help! If you have any other suggestions, I would appreciate it very much.

RobotecAI / RobotecGPULidar

All tests failed due to `cuda error: operation not supported (code=801)` #322