Memory error solved by emptying CUDA cache

theophilec commented 3 years ago

Hi FALKON team!

While using Falkon, I stumbled on what looks like a memory bug in the library.

Code to reproduce

import torch
from falkon.kernels import LinearKernel
from falkon import Falkon

n = 50000
d = 51000
l = 10

X = torch.randn((n, d))
y = torch.nn.functional.one_hot(torch.randint(0, l, (n,))).float()
sigma = 1
penalties = [1e-4, 1e-5]
for i in range(2):
    print(f"{i}")
    kernel = LinearKernel(sigma=sigma)
    model = Falkon(
        kernel=kernel,
        penalty=penalties[i],
        M=40000,
        maxiter=10,
        seed=0,
    )
    model.fit(X, y)
    predictions = model.predict(X)
    # torch.cuda.empty_cache()
    # If the line above is commented, FALKON induces a CUDA Out of Memory error.

Expected behavior

Fit different models, no errors.

Actual behavior

RuntimeError: CUDA out of memory. Tried to allocate 9.91 GiB (GPU 1; 31.75 GiB total capacity; 12.38 GiB already allocated; 8.59 GiB free; 21.75 GiB reserved in total by PyTorch)

In the above code,

adding the torch.cuda.empty_cache() line eliminates the issue.
changing d = 51000 to d = 30000 eliminates the issue.

Environment

Ubuntu 18.04 LTS
4x Tesla V100 with 32 GB RAM each, CUDA 11.0
Python 3.8.8 with pytorch 1.9
Falkon compiled with pip

Let me know if I can provide any further information or assistance in fixing the issue! Thanks!

Giodiro commented 3 years ago

Hi @theophilec ! Thanks for the bug report!

In trying to reproduce the OOM I set up an environment similar to yours (main difference is GPU model and CUDA 11.1 instead of 11.0), but cannot seem to trigger the out-of-memory. This being said, I remember running into issues like this (memory not freed between runs) before, so I am quite confident that with a bit more help from your side the issue should become somewhat clearer.

If possible could you try to run the same code with a few more debugging statements added in. Something like this would be helpful:

import torch
from falkon.kernels import LinearKernel
from falkon import Falkon, FalkonOptions

n = 50000
d = 51000
l = 10

X = torch.randn((n, d))
y = torch.nn.functional.one_hot(torch.randint(0, l, (n,))).float()
opt = FalkonOptions(debug=True)
sigma = 1
penalties = [1e-4, 1e-5]
for i in range(2):
    print(f"Fitting {i}")
    kernel = LinearKernel(sigma=sigma, opt=opt)
    model = Falkon(
        kernel=kernel,
        penalty=penalties[i],
        M=40000,
        maxiter=10,
        seed=0,
        options=opt,
    )
    model.fit(X, y)
    predictions = model.predict(X)
    print(f"Memory used after fit {i}")
    for device in range(torch.cuda.device_count()):
        print(torch.cuda.memory_summary(device))
    # torch.cuda.empty_cache()
    # If the line above is commented, FALKON induces a CUDA Out of Memory error.

theophilec commented 3 years ago

Sounds good @Giodiro!

Here is the result of the execution:

Fitting 0
MainProcess.MainThread::[Calcuating Preconditioner of size 40000]
Preconditioner will run on 4 GPUs
--MainProcess.MainThread::[Kernel]
--MainProcess.MainThread::[Kernel] complete in 30.483s
--MainProcess.MainThread::[Cholesky 1]
Using in-core POTRF
--MainProcess.MainThread::[Cholesky 1] complete in 2.679s
--MainProcess.MainThread::[Copy triangular]
--MainProcess.MainThread::[Copy triangular] complete in 0.680s
--MainProcess.MainThread::[LAUUM(CUDA)]
--MainProcess.MainThread::[LAUUM(CUDA)] complete in 1.934s
--MainProcess.MainThread::[Cholesky 2]
Using in-core POTRF
--MainProcess.MainThread::[Cholesky 2] complete in 3.059s
MainProcess.MainThread::[Calcuating Preconditioner of size 40000] complete in 38.836s
50000*40000 Kernel matrix will be stored
MainProcess.MainThread::[Computing Falkon iterations]
Optimizer will run on 4 GPUs
MainProcess.MainThread::[Computing Falkon iterations] complete in 9.166s
Memory used after fit 0
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from large pool |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from large pool |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   12126 MB |   25950 MB |   61732 MB |   49606 MB |
|       from large pool |   12124 MB |   25946 MB |   61722 MB |   49598 MB |
|       from small pool |       2 MB |       4 MB |      10 MB |       8 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    9791 KB |   27595 KB |   27595 KB |
|       from large pool |       0 B  |    7743 KB |    9163 KB |    9163 KB |
|       from small pool |       0 B  |    2047 KB |   18432 KB |   18432 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |       9    |      18    |      16    |
|       from large pool |       1    |       7    |      13    |      12    |
|       from small pool |       1    |       2    |       5    |       4    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       7    |      24    |      24    |
|       from large pool |       0    |       5    |       6    |       6    |
|       from small pool |       0    |       2    |      18    |      18    |
|===========================================================================|

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from large pool |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from large pool |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   12126 MB |   27306 MB |   64428 MB |   52302 MB |
|       from large pool |   12124 MB |   27300 MB |   64416 MB |   52292 MB |
|       from small pool |       2 MB |       6 MB |      12 MB |      10 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    4066 MB |    8158 MB |    8158 MB |
|       from large pool |       0 B  |    4064 MB |    8136 MB |    8136 MB |
|       from small pool |       0 B  |       1 MB |      21 MB |      21 MB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      36    |      36    |
|       from large pool |       0    |       5    |      16    |      16    |
|       from small pool |       0    |       2    |      20    |      20    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      36    |      36    |
|       from large pool |       0    |       5    |      16    |      16    |
|       from small pool |       0    |       2    |      20    |      20    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |      10    |      19    |      17    |
|       from large pool |       1    |       7    |      13    |      12    |
|       from small pool |       1    |       3    |       6    |       5    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |      27    |      27    |
|       from large pool |       0    |       2    |       7    |       7    |
|       from small pool |       0    |       2    |      20    |      20    |
|===========================================================================|

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 2                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   12126 MB |   27304 MB |   64426 MB |   52300 MB |
|       from large pool |   12124 MB |   27300 MB |   64416 MB |   52292 MB |
|       from small pool |       2 MB |       4 MB |      10 MB |       8 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    5133 KB |   25897 KB |   25897 KB |
|       from large pool |       0 B  |    3086 KB |    7465 KB |    7465 KB |
|       from small pool |       0 B  |    2047 KB |   18432 KB |   18432 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |       9    |      18    |      16    |
|       from large pool |       1    |       7    |      13    |      12    |
|       from small pool |       1    |       2    |       5    |       4    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |      23    |      23    |
|       from large pool |       0    |       2    |       5    |       5    |
|       from small pool |       0    |       2    |      18    |      18    |
|===========================================================================|

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 3                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   12126 MB |   27304 MB |   64426 MB |   52300 MB |
|       from large pool |   12124 MB |   27300 MB |   64416 MB |   52292 MB |
|       from small pool |       2 MB |       4 MB |      10 MB |       8 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    5133 KB |   25897 KB |   25897 KB |
|       from large pool |       0 B  |    3086 KB |    7465 KB |    7465 KB |
|       from small pool |       0 B  |    2047 KB |   18432 KB |   18432 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |       9    |      18    |      16    |
|       from large pool |       1    |       7    |      13    |      12    |
|       from small pool |       1    |       2    |       5    |       4    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |      23    |      23    |
|       from large pool |       0    |       2    |       5    |       5    |
|       from small pool |       0    |       2    |      18    |      18    |
|===========================================================================|

Fitting 1
MainProcess.MainThread::[Calcuating Preconditioner of size 40000]
Preconditioner will run on 4 GPUs
--MainProcess.MainThread::[Kernel]
--MainProcess.MainThread::[Kernel] complete in 17.233s
MainProcess.MainThread::[Calcuating Preconditioner of size 40000] complete in 17.233s
Traceback (most recent call last):
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/utils/threading.py", line 16, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/mmv_ops/fmm_cuda.py", line 144, in _generic_fmm
    gX2_list = [create_same_stride((m, d), X2, gpu_dtype, tc_device) for _ in range(num_streams)]
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/mmv_ops/fmm_cuda.py", line 144, in <listcomp>
    gX2_list = [create_same_stride((m, d), X2, gpu_dtype, tc_device) for _ in range(num_streams)]
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/utils/tensor_helpers.py", line 76, in create_same_stride
    return create_C(size, dtype, device, pin_memory)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/utils/tensor_helpers.py", line 156, in create_C
    return _new_strided_tensor(tuple(size), stride, dtype, device, pin_memory)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/utils/tensor_helpers.py", line 27, in _new_strided_tensor
    return torch.empty_strided(
RuntimeError: CUDA out of memory. Tried to allocate 9.91 GiB (GPU 1; 31.75 GiB total capacity; 12.38 GiB already allocated; 8.59 GiB free; 21.75 GiB reserved in total by PyTorch)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/theophile/falkon_bug/falkon_memory_bug_reproduction_2.py", line 25, in <module>
    model.fit(X, y)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/models/falkon.py", line 209, in fit
    precond.init(ny_points, weight_vec=ny_weight_vec)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/preconditioner/flk_preconditioner.py", line 101, in init
    self.kernel(X, X, out=C, opt=self.params)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/kernels/kernel.py", line 172, in __call__
    return mm_impl(X1, X2, self, out, params)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/mmv_ops/fmm_cuda.py", line 283, in fmm_cuda
    _start_wait_processes(_generic_fmm, args)
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/mmv_ops/utils.py", line 55, in _start_wait_processes
    p.join()
  File "/home/theophile/anaconda3/lib/python3.8/site-packages/falkon/utils/threading.py", line 23, in join
    raise RuntimeError('Exception in thread %s' % (self.name)) from self.exc
RuntimeError: Exception in thread GPU-1

theophilec commented 3 years ago

PS: if I empty the cache then show the summaries again, the ~ 12 GB of Cur Usage disappear. See below:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from large pool |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from large pool |       0 B  |   24124 MB |   61809 MB |   61809 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |   25950 MB |   61732 MB |   61732 MB |
|       from large pool |       0 B  |   25946 MB |   61722 MB |   61722 MB |
|       from small pool |       0 B  |       4 MB |      10 MB |      10 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    9791 KB |   27595 KB |   27595 KB |
|       from large pool |       0 B  |    7743 KB |    9163 KB |    9163 KB |
|       from small pool |       0 B  |    2047 KB |   18432 KB |   18432 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       9    |      18    |      18    |
|       from large pool |       0    |       7    |      13    |      13    |
|       from small pool |       0    |       2    |       5    |       5    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       7    |      24    |      24    |
|       from large pool |       0    |       5    |       6    |       6    |
|       from small pool |       0    |       2    |      18    |      18    |
|===========================================================================|

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from large pool |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from large pool |       0 B  |   25482 MB |   76711 MB |   76711 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |   27306 MB |   64428 MB |   64428 MB |
|       from large pool |       0 B  |   27300 MB |   64416 MB |   64416 MB |
|       from small pool |       0 B  |       6 MB |      12 MB |      12 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    4066 MB |    8158 MB |    8158 MB |
|       from large pool |       0 B  |    4064 MB |    8136 MB |    8136 MB |
|       from small pool |       0 B  |       1 MB |      21 MB |      21 MB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      36    |      36    |
|       from large pool |       0    |       5    |      16    |      16    |
|       from small pool |       0    |       2    |      20    |      20    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      36    |      36    |
|       from large pool |       0    |       5    |      16    |      16    |
|       from small pool |       0    |       2    |      20    |      20    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |      10    |      19    |      19    |
|       from large pool |       0    |       7    |      13    |      13    |
|       from small pool |       0    |       3    |       6    |       6    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |      27    |      27    |
|       from large pool |       0    |       2    |       7    |       7    |
|       from small pool |       0    |       2    |      20    |      20    |
|===========================================================================|

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 2                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |   27304 MB |   64426 MB |   64426 MB |
|       from large pool |       0 B  |   27300 MB |   64416 MB |   64416 MB |
|       from small pool |       0 B  |       4 MB |      10 MB |      10 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    5133 KB |   25897 KB |   25897 KB |
|       from large pool |       0 B  |    3086 KB |    7465 KB |    7465 KB |
|       from small pool |       0 B  |    2047 KB |   18432 KB |   18432 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       9    |      18    |      18    |
|       from large pool |       0    |       7    |      13    |      13    |
|       from small pool |       0    |       2    |       5    |       5    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |      23    |      23    |
|       from large pool |       0    |       2    |       5    |       5    |
|       from small pool |       0    |       2    |      18    |      18    |
|===========================================================================|

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 3                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from large pool |       0 B  |   25482 MB |   64504 MB |   64504 MB |
|       from small pool |       0 B  |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |   27304 MB |   64426 MB |   64426 MB |
|       from large pool |       0 B  |   27300 MB |   64416 MB |   64416 MB |
|       from small pool |       0 B  |       4 MB |      10 MB |      10 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |    5133 KB |   25897 KB |   25897 KB |
|       from large pool |       0 B  |    3086 KB |    7465 KB |    7465 KB |
|       from small pool |       0 B  |    2047 KB |   18432 KB |   18432 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       7    |      32    |      32    |
|       from large pool |       0    |       5    |      14    |      14    |
|       from small pool |       0    |       2    |      18    |      18    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       9    |      18    |      18    |
|       from large pool |       0    |       7    |      13    |      13    |
|       from small pool |       0    |       2    |       5    |       5    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |      23    |      23    |
|       from large pool |       0    |       2    |       5    |       5    |
|       from small pool |       0    |       2    |      18    |      18    |
|===========================================================================|

Giodiro commented 3 years ago

Okay, so it seems that in reality things are getting freed correctly (cue the fact that allocated memory is 0 at the end of fit). I think the error message gives a clue in that the free memory on the GPU is:

the amount not reserved by PyTorch (10GiB)
the amount that is marked free inside PyTorch's allocator (8.59GiB) which together should be enough to make the 9.91GiB allocation succeed.

So it could be possible that the large allocation performed on the first call to fit (which succeeds) becomes fragmented at the beginning of the second call to fit (before the large allocation which crashes) leading to the allocation which is crashing to not have enough contiguous free memory to succeed.

There is a PyTorch issue which is relevant: https://github.com/pytorch/pytorch/issues/35901 and a related pull-request https://github.com/pytorch/pytorch/pull/44742 but I'm not sure whether it will be merged in the next release or not!

To debug this further and make sure that the interpretation I'm giving is correct I checked the non-releasable memory just before the allocation which fails https://github.com/FalkonML/falkon/blob/4bae112944212488eaf264f74dd7be9fe6ca5858/falkon/mmv_ops/fmm_cuda.py#L144 and confirm it increases between the first and second fit. I'm also trying to play a bit with the data sizes to see if I can manage to cause a crash on my GPU as well, but haven't managed yet.

Not sure if you have any ideas on how to reduce fragmentation, but I think the cleanest / easiest course of action until the PyTorch oversize blocks patch lands is to empty the PyTorch allocator after each iteration (which is what you've been doing!).

I'm sorry this is a bit of a disappointing fix!

theophilec commented 3 years ago

This makes sense, thanks.

Indeed the PyTorch issue seems relevant. Do you know why wasn't I observing non-releasable memory in my memory_summary() calls but you were?

No problem in any case: the current state of affairs works for me for now. Hopefully, they will merge the PR. :)

FalkonML / falkon