how to free gpu memory after training with MLJ interface

xgdgsc commented 2 years ago

running

mach = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y, cache=false)
mach1 = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y1, cache=false)
mach2 = machine(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y2, cache=false)
...

could add to gpu memory pool usage by several GBs after each line run. Is it possible to free everything used in GPU training as I would only need CPU when prediction?

jeremiedb commented 2 years ago

Would you have a reproducible example to provide? When trying the creation of various machines (mach1, mach2...), I didn't notied any increase in GPU memory usage. I did noticed increase following call to fit! though. I however appears that the GC mechanics end up reclaiming memor as needed. Did you it OOM on GPU errors?

Also, you could give a try at dev branch, in which I've added a CUDA.reclaim() call, though I'm not clear whether it brings any help.

FInally, regarding predicitons on CPU form a GPU trained model, you can convert a GBTreeGPU to a regular GBTree cpu model using convert:

model_cpu = convert(EvoTrees.GBTree, model);
pred_cpu = predict(model_cpu, x_train)

xgdgsc commented 2 years ago

Yes. I forgot to mention the fit! part. After fitting it grows from 250MB to 7GB and after call of CUDA.reclaim() it reduces to 4GB.

I was running a timeseries CV like:

function runMLJModel(targetModel, X, Y; train_size = 0.8,nfolds=5,verbosity=1,cache=false)
    mach = machine(targetModel, X, Y, cache=cache)
    tscv = TimeSeriesCV(; nfolds=nfolds)
    evalResult=evaluate!(mach,  resampling=tscv, measure=[rmse,mae], verbosity=verbosity)#,acceleration=CPUThreads())
    evalResult,mach
end

tree_model_gpu = EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100)

evalResult,mach=runMLJModel(tree_model_gpu, X,Y,train_size=0.8,nfolds=5)

jeremiedb commented 2 years ago

From what I've oberved on my end (RTX A4000), there appears to be some instability about the training time on GPU, though the garbage collection appears to work appropriately so that I don't OOM over repeated runs. Do you experience OOM crashes? Outside of the CUDA.reclaim(), which will be part of v0.12.1 I'm about to release, I'm afraid I don't have other short term fixes. I'm aware of quite a few caveats reading the GPU imlpementations, notably as I relied on some not so clean scalar operations at a few places to handle CPU/GPU transfer. In short, I'm pretty sure there are quite a few low hanging fruits to improve GPU performance, but it's an area I don't expect to be able to invest much efforts in the short term.

xgdgsc commented 2 years ago

Thanks. I tested latest version and it seems gpu memory not freed and I get OOM very soon. Could it be using TimeSeriesCV making it more obvious?

jeremiedb commented 2 years ago

Just to clarify, is the situation worse with v0.12.1 or it simply didn't bring any improvement? I doubt the usage of TimeSeriesCV could be a cause of memory GPU, I'm pretty sure it has to do with how allocations are handled within EvoTrees. For info, what are the dimensions of your data and GPU model?

xgdgsc commented 2 years ago

size(X)
(11664400, 36)
size(Y)
(11664400,)
evalResultG, machG = runMLJModel(EvoTreeRegressor(loss=:linear, device="gpu", max_depth=5, eta=0.01, nrounds=100), X, Y,  train_size=0.8, nfolds=5)

didn't bring any improvement

before:
Memory pool usage: 0 bytes (0 bytes reserved)
after:
Memory pool usage: 8.801 GiB (18.031 GiB reserved)

jeremiedb commented 2 years ago

Thanks! Can you share your GPU model (how much memory?) Also, although not related to memory issue, with v0.12.2, I've just fixed an issue that resulted in potentially important slowdown when training on GPU with large number of threads (like 8+). So if your Julia environment was with large number of threads, you may experience improvement, in the ~40%-60% on my end.

xgdgsc commented 2 years ago

RTX 3090 24G

jeremiedb commented 2 years ago

I just performed a test on a dataset of the same size and it ran successfully on my smaller RTX A4000 with 16G. The test involved looping 10 iteration with fit_evotree. Therefore, I suspect there might be something related to how MLJ might save the models' cache. Could you confirm if the following approach works fine on your end?

using Revise
using Statistics
using StatsBase: sample
using EvoTrees
using BenchmarkTools
using CUDA

nrounds = 200
nthread = Base.Threads.nthreads()

@info nthread

# EvoTrees params
params_evo = EvoTreeRegressor(
    T=Float64,
    loss="linear",
    nrounds=nrounds,
    alpha=0.5,
    lambda=0.0,
    gamma=0.0,
    eta=0.05,
    max_depth=6,
    min_weight=1.0,
    rowsample=1.0,
    colsample=1.0,
    nbins=64,
    device = "gpu"
)

nobs = Int(11_664_400)
num_feat = Int(36)
@info "testing with: $nobs observations | $num_feat features."
x_train = rand(nobs, num_feat)
y_train = rand(size(x_train, 1))

@info "evotrees train GPU:"
params_evo.device = "gpu"
@time m_evo_gpu = fit_evotree(params_evo; x_train, y_train);
for i in 1:5
    @time m_evo_gpu = fit_evotree(params_evo; x_train, y_train);
end

Note that using T=Float32 instaad of Float64 should also help keep memory under control and improve training speed. I nonetheless could fit a depth of 6 with Float64 on a 16G GPU.

xgdgsc commented 2 years ago

This memory pool usage still goes up from 200MB to 4GB after running for me.

jeremiedb commented 2 years ago

If thre's no longer OOM or other breaking error, I'm not sure to understand how the 4GB memory consumption is actually problematic. Could you clarify?

Otherwise, adding a GC call prior to the CUDA.reclaim seems to help releasing more memory. That is, something like:

GC.gc(true)
CUDA.reclaim()

xgdgsc commented 2 years ago

Thanks. Fine after GC.gc(true)

jeremiedb commented 2 years ago

A gc + reclaim has been added for GPU following a fit_evotree routine in 0.12.4. It needs to be called manually when using MLJ since having a gc call after each tree would impairs fitting performance (as much as 50% of time spent on GC) while memory is otherwise properly handled by CUDA during the fitting process.

Evovest / EvoTrees.jl

how to free gpu memory after training with MLJ interface #171