Closed sgalee2 closed 11 months ago
Thanks for raising this issue. What version of PyTorch are you using? A quick calculation would expect ~31 MiB in memory usage (that is 8 bytes per float64, the Gram matrix would be 2000x2000). Running a simple test on the CPU shows me this is allocated 7 times (due to inefficiencies, such as unnecessary copying or non-inplace operations), which we should work on improving.
The fact that it allocates 15 GB is way too much though. Is this at the first call of that function? Or could it be memory leaks from earlier? If you try with a blank slate, does this happen as well? What is the shape of the inputs
tensor you pass?
I don't have a GPU at hand, but try and check the memory usage statistics for CUDA: https://pytorch.org/docs/stable/cuda.html#memory-management perhaps it can pinpoint to a particularly large tensor or perhaps memory that doesn't get released (leaks).
This is in PyTorch 1.13.0+cu116, the issue usually arises after more than one call but not always... I have kept diagnostics open during the whole script and the VRAM occupied spikes when any function regarding the kernel is called e.g. model.gpr.kernel(inputs), model.gpr.kernel.Ksub(inputs).
inputs is roughly a [3000 x 2] array.
Let me see what we can do to debug this. @felipe-tobar do you think we can have access to a GPU?
Saludos cordiales, Taco de Wolff
On Mon, Apr 24, 2023 at 10:02 AM sgalee2 @.***> wrote:
This is in PyTorch 1.13.0+cu116, the issue usually arises after more than one call but not always... I have kept diagnostics open during the whole script and the VRAM occupied spikes when any function regarding the kernel is called e.g. model.gpr.kernel(inputs), model.gpr.kernel.Ksub(inputs).
inputs is roughly a [3000 x 2] array.
— Reply to this email directly, view it on GitHub https://github.com/GAMES-UChile/mogptk/issues/61#issuecomment-1520218747, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKOGHWRPS6YK4KEQXBNMV3XC2BWZANCNFSM6AAAAAAW6IH6E4 . You are receiving this because you commented.Message ID: @.***>
Sure, I’ll explain in a different email
f
On 03-05-2023, at 11:06, Taco de Wolff @.***> wrote:
Let me see what we can do to debug this. @felipe-tobar do you think we can have access to a GPU?
Saludos cordiales, Taco de Wolff
On Mon, Apr 24, 2023 at 10:02 AM sgalee2 @.***> wrote:
This is in PyTorch 1.13.0+cu116, the issue usually arises after more than one call but not always... I have kept diagnostics open during the whole script and the VRAM occupied spikes when any function regarding the kernel is called e.g. model.gpr.kernel(inputs), model.gpr.kernel.Ksub(inputs).
inputs is roughly a [3000 x 2] array.
— Reply to this email directly, view it on GitHub https://github.com/GAMES-UChile/mogptk/issues/61#issuecomment-1520218747, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKOGHWRPS6YK4KEQXBNMV3XC2BWZANCNFSM6AAAAAAW6IH6E4 . You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/GAMES-UChile/mogptk/issues/61#issuecomment-1533207585, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACT3KG43IXHTPTCET3ZOFZTXEJX6JANCNFSM6AAAAAAW6IH6E4. You are receiving this because you were mentioned.
This may be due to the use of double precision by default, this has now changed as of v0.3.5 which defaults to float32. Unfortunately the memory usage of PyTorch in VRAM is quite large since it has to fit the whole CUDA context as well as the PyTorch context.
I've done some tests to check for scalability of the various parameters, see the results below. Each variable is changed while the others are kept constant. In the case of output dimension, we use 1600 training points and divide them by the number of channels to get the number of training points per channel (1600 total).
Conclusions:
Therefore, increasing the number of channels will quickly degrade performance as shown since it scales quadratic and on top of that it is really slow (14 seconds for 16 output dimensions, 1600 training points in total, 2 input dimensions, 2 mixture components, and 100 iterations).
the resulting covariance matrix cannot be fully formed before my VRAM (15GBs) becomes entirely occupied.
I've done some small tests regarding memory usage for the MOSM with (output_dims, data_points, input_dims, components):
The increase in memory usage for two input dimensions is surprising, otherwise the scaling looks fine to me. Not sure why we allocate 234 MB for a 3000x300 Gram matrix and all intermediate tensors though. However, it isn't nearly as much as 15 GB @sgalee2! So something must be off in your case and I can't replicate it.
I've reverted the change of the default dtype to float64 to avoid precision errors. But you can set it manually to float32, this would alleviate about 50% of the VRAM required.
It occurred to me, perhaps your input data was incorrect and you really sent it 3000 data points per channel, and not 150 per channel, for all the 20 channels. This would quickly deplete VRAM since it would require about 18 GB of VRAM using the float64.
I was wrong in my previous post about being able to reduce memory usage further (or so I believe), since PyTorch requires all intermediate results (as calculated in the Exact model or the kernel) to be stored. This is why using batches is so important for other uses of the GPU, which are unavailable in our use-case. We must fit the entire data set in our model, which comes down to O(Q M^2 N^2) multiplied by a constant (Q
components, M
channels, N
data points per channel) in memory usage. This is why reducing your data set, or using inducing points has been such a prominent advance in the field.
Considering this issue closed since there is no bug in the library AFAIK.
I am doing some simple analysis of model covariance matrices with the multioutput spectral mixture kernel. The model in question has
This results in a covariance matrix with dimensions less than $10^4 \times 10^4$, which in float64 should be pretty small in storage... However, when I call
model.gpr.kernel(inputs)
the resulting covariance matrix cannot be fully formed before my VRAM (15GBs) becomes entirely occupied.
Any ideas as to why this memory leak is happening?
Thanks (again)!