NVIDIA-Genomics-Research / rapids-single-cell-examples

Examples of single-cell genomic analysis accelerated with RAPIDS
Apache License 2.0
318 stars 68 forks source link

Residual GPU Memory usage #96

Open r614 opened 2 years ago

r614 commented 2 years ago

hi! i am trying to use the scanpy rapids functions to run multiple parallel operations on a server.

the problem i am running into is that after running any scanpy function with rapids enabled, there is some residual memory usage after the function call has ended, and I am assuming this is either because of a memory leak, or because the result itself is stored on the gpu.

during scanpy.tl.neighbors + scanpy.tl.umap call:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   51C    P0    60W /  70W |   7613MiB / 15109MiB |      100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

post function run:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   51C    P0    35W /  70W |   1564MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

we arent running any gpu load besides the umap function, and idle memory usage is ~75Mib.

happy to elaborate more + help find a fix for this. not sure if i am missing something really easy (maybe a cupy.asnumpy somewhere?), so any info would be super helpful!

r614 commented 2 years ago

follow up issues after more experimentation, not sure if related:

cjnolet commented 2 years ago

@r614 a CUDA context is created on the GPU before executing any kernels, which will store some metadata and other things like loaded libraries. The CUDA context is usually initialized when calls are made to the CUDA runtime API (such as launching a kernel, for example) and generally lasts for the lifetime of the process. This very small amount of memory (in the range of 10s to 100s of mb) is expected.

IIRC, Scanpy will copy results back to CPU and the GPU memory should eventually cleaned up when the corresponding Python objects are cleaned up. However, it's always possible this might not happen immediately and might require waiting for the garbage collector.

Managed memory is a little exception to the above. You can use it to oversubscribe the GPU memory so you don't immediately get out of memory errors, but that does come at the cost of increased thrashing potential as memory is paged into and out of the GPU as needed. Unfortunately, PCA does require computing the eigenpairs on a covariance matrix, which in your case looks like it would require 24929^2 entries- that's ~2.5GB of 32-bit float values.

I recall at one point there was an additional limit imposed by the eigensolver itself (from cusolver directly), which wouldn't allow the number of columns^2 to be larger than 2^(32-1). This seems like it might be the case here. Can you print the output of conda list? I think this bug might have been fixed recently but I can't recall whether the fix is in CUDA 11.5.

Another benefit to the highly variable gene feature selection we do in our examples is that we avoid these limitations in the PCA altogether.

r614 commented 2 years ago

thanks for the detailed reply!

do you know if there is a workaround for forcing the creation of a new context/garbage collection at the api-level - maybe something akin to torch.cuda.empty_cache(). the garbage collection doesn't seem to trigger even after long periods of inactivity, and wrapping each scanpy/cuda task in its own sub-process will add a lot of complexity.

will post the conda output once I get my environment up again later today. would love to get a stable managed memory setup working - what memory gpu would you recommend for running computations on this size of a dataset? we ran into this on a 16GB gpu, and ran into OOM issues without unified memory.