Closed pkese closed 3 years ago
This certainly sounds like a bug. Something is not disposing.
I don't have any specific guidelines - can you share the code so we can assess the primitives you're using?
I'll try to extract and share the gist of that code in a way what should keep my employer and their clients happy.
But I may take a day or two (or over the weekend) to find time for that.
I have some slightly updated advice waiting in a PR. The source is here: https://github.com/NiklasGustafsson/TorchSharp/blob/examples/docfx/articles/memory.md
Basically, you have to be very aggressive with GC -- for some training, I have had to set N=1, i.e. call GC.Collect() every batch. Granted, this has been with GPU-based training, so it may be a little bit different, but the general problem is the same.
It's worth taking a look at how data is loaded -- you may have to discard not just temporaries inside the calculation, but also the input tensors after each batch. I've not had to do that on any of the sample data sets, but if you have huge inputs, then you may have to deal with the unfortunate issue of re-loading input from disk every epoch.
Anyway, not sure any of this addresses your problem, it's just some random thoughts.
This turned out to be quite interesing ...
The model works fine, when parameters (logKernel, logScale, logEmbeddings) are defined as private variables,
type Model(device, nTraces, convLen, nComponents) =
let logKernel = Float32Tensor.rand([|nComponents; 1L; convLen|], device, true)
let logScale = Float32Tensor.from([|-7.0f|], true)
let logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )
member this.parameters = [| logKernel; logScale; logEmbeddings.Weight |]
member n.forward (xs:TorchTensor, indices:TorchTensor) =
let factors = (logEmbeddings.forward indices + logScale).exp().unsqueeze(2L)
let kernel = logKernel.exp()
let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
let compOuts = ins.conv1d(kernel, groups=nComponents)
let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
outs
however, if these parameters are defined as class properties (below), then the memory issue occurs (consuming about 500 MB per second while training):
type Model(device, nTraces, convLen, nComponents) =
member _.logKernel = Float32Tensor.zeros([|nComponents; 1L; convLen|], device, true)
member _.logScale = Float32Tensor.from([|-7.0f|], true)
member _.logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )
member n.parameters = [| n.logKernel; n.logScale; n.logEmbeddings.Weight |]
member n.forward (xs:TorchTensor, indices:TorchTensor) =
let factors = (n.logEmbeddings.forward indices + n.logScale).exp().unsqueeze(2L)
let kernel = n.logKernel.exp()
let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
let compOuts = ins.conv1d(kernel, groups=nComponents)
let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
outs
I'll post the whole notebook somewhere (github doesn't wan't to render the new format .ipynb gists).
Nah, I can't make github show the notebook (even if I delete all the graphics), but I've uploaded the file to https://github.com/pkese/fs-codespace-test/blob/main/TorchSharp-Leak-Test.ipynb
It can be downloaded and shown in notebook editor (I'm using vscode-insiders for editing).
...or that's the code:
#r "nuget: libtorch-cpu, 1.8.0.7"
#r "nuget: TorchSharp, 0.91.52518"
open TorchSharp
open TorchSharp.Tensor
open TorchSharp.NN
let device = Torch.InitializeDevice Device.CPU
(*
#### PCA Deconvolution
We render 1000s of traces with 1D random `xs` and calculated `ys` and wish to infer parameters of transform.
The transform is done by:
- choosing a random mixture of 4 component functions (e.g. sin(x), cos(x))
- storing the mixture into a kernel
- convolving the kernel over `xs` to get `ys`
The task is to reconstruct the:
- shape of 4 components forming kernels
- mixture weights of 4 components for each sample (embedding lookups)
*)
let xsLen = 32
let convLen = 16
let components = [| // component functions that we will later try to reconstruct (deconvolve)
fun x -> Math.Sin (x/float convLen*3.14) * 0.5 + 0.5
fun x -> Math.Cos (x/float convLen*3.14) * 0.5 + 0.5
fun x -> -Math.Sin (x/float convLen*3.14) * 0.5 + 0.5
fun x -> -Math.Cos (x/float convLen*3.14) * 0.5 + 0.5
|]
let nComponents = components.Length
let random = Random()
module SampleGenerator =
let dirichlet n =
let ps = Array.init n (fun _ -> random.NextDouble()**2.0)
let sum = ps |> Array.sum
ps |> Array.map (fun p -> p / sum)
let randomComponentMixtureKernel len =
let mixture = dirichlet components.Length
let compMix = Array.zip components mixture
Array.init len (fun i -> compMix |> Array.sumBy (fun (compFn,weight) -> (compFn (float i) * weight)))
let conv (kernel: float[]) (xs: float[]) =
Array.init (xs.Length-kernel.Length+1) (fun i ->
let mutable sum = 0.0
for j in 0 .. kernel.Length-1 do sum <- sum + xs.[i+j] * kernel.[j]
sum)
let flipCoin (rate: float) (trials: float) = Math.Round(trials * rate)
let renderSample xsLen convLen =
let kernel = randomComponentMixtureKernel convLen
let xs = Array.init xsLen (fun _ -> random.Next 1000 |> float)
let rate = random.NextDouble() * 0.01
let ys = xs |> conv kernel |> Array.map (flipCoin rate)
xs, ys
//SampleGenerator.renderSample xsLen convLen
let nTraces = 10000
let renderDataset xsLen convLen nTraces =
let tensor (xs: float[]) = Float32Tensor.from(Array.map float32 xs, false)
let xs, ys =
Array.init nTraces (fun _ ->
let xs, ys = SampleGenerator.renderSample xsLen convLen
tensor xs, tensor ys)
|> Array.unzip
let mutable i0=0
fun batchSize ->
let xs, ys, indices =
Array.init batchSize (fun i ->
i0 <- (i0+1) % nTraces
xs.[i0], ys.[i0], i0)
|> Array.unzip3
xs.stack 0L, ys.stack 0L, Int32Tensor.from(indices, false)
let generateBatch = renderDataset xsLen convLen nTraces
let inline fscalar x = TorchScalar.op_Implicit (float32 x)
let epsilon = fscalar 10e-12
// this model works fine
type Model(device, nTraces, convLen, nComponents) =
inherit CustomModule("deconv")
let nTraces, nComponents, convLen = int64 nTraces, int64 nComponents, int64 convLen
let logKernel = Float32Tensor.rand([|nComponents; 1L; convLen|], device, true)
let logScale = Float32Tensor.from([|-7.0f|], true)
let logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )
member this.parameters = [| logKernel; logScale; logEmbeddings.Weight |]
override _.forward (x:TorchTensor) = failwithf "wrong method"
member n.forward (xs:TorchTensor, indices:TorchTensor) =
let factors = (logEmbeddings.forward indices + logScale).exp().unsqueeze(2L)
let kernel = logKernel.exp()
let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
//printfn "ins=%A kernel=%A" ins.shape kernel.shape
let compOuts = ins.conv1d(kernel, groups=nComponents)
//printfn "conv=%A, factors=%A, kernel=%A" compOuts.shape factors.shape kernel.shape
let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
outs
member _.Kernel with get () = logKernel.exp()
member _.Scale with get () = logScale
member n.modelLoss() =
(logKernel.exp().sum([|2L|], keepDimension=true) - fscalar (convLen/2L)).abs().mean()
//+ (logEmbeddings.Weight.exp().sum([|1L|], keepDimension=true) - fscalar 1.0).mean()
// this model leaks memory
(*
type Model(device, nTraces, convLen, nComponents) =
inherit CustomModule("deconv")
let nTraces, nComponents, convLen = int64 nTraces, int64 nComponents, int64 convLen
member _.logKernel = Float32Tensor.zeros([|nComponents; 1L; convLen|], device, true)
member _.logScale = Float32Tensor.from([|-7.0f|], true)
member _.logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )
member n.parameters = [| n.logKernel; n.logScale; n.logEmbeddings.Weight |]
override _.forward (x:TorchTensor) = failwithf "wrong method"
member n.forward (xs:TorchTensor, indices:TorchTensor) =
let factors = (n.logEmbeddings.forward indices + n.logScale).exp().unsqueeze(2L)
let kernel = n.logKernel.exp()
let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
//printfn "ins=%A kernel=%A" ins.shape kernel.shape
let compOuts = ins.conv1d(kernel, groups=nComponents)
//printfn "conv=%A, factors=%A, kernel=%A" compOuts.shape factors.shape kernel.shape
let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
outs
member n.Kernel with get () = n.logKernel.exp()
member n.Scale with get () = n.logScale
member n.modelLoss() =
(n.logKernel.exp().sum([|2L|], keepDimension=true) - fscalar (convLen/2L)).abs().mean()
*)
let net = new Model(device, nTraces, convLen, nComponents)
let inline xlogy(x:TorchTensor, y:TorchTensor) = x * y.log() // xlogy is not exposed in TorchSharp
let poissonLoss (k:TorchTensor) (mu:TorchTensor) =
let logPmf = xlogy(k,mu) - (k+fscalar 1.0).lgamma() - mu
-logPmf
let criterion ys ys' = (poissonLoss ys ys').clamp_max(fscalar 10000.0).mean()
let optimizer = TorchSharp.NN.Optimizer.Adam(net.parameters, 0.02)
//let optimizer = TorchSharp.NN.Optimizer.SGD(net.parameters, 0.1)
let batchSize = 384
let mutable cumLoss = 0.0
let mutable nItems = 0
for i in 0..200000 do
optimizer.zero_grad()
let xs, ys, indices = generateBatch batchSize
let ys' = net.forward(xs,indices)
let loss = criterion ys ys' + net.modelLoss()
loss.backward()
optimizer.step()
cumLoss <- cumLoss + loss.ToDouble()
nItems <- nItems + 1
if i%10000 = 0 then
System.Console.Write $"step %6d{i}: loss=%.4f{cumLoss / float nItems} loss0=%.4f{loss.ToDouble()} scale=%.4f{net.Scale.ToDouble()}"
cumLoss <- 0.0
nItems <- 0
if i%1000 = 0 then
GC.Collect()
//#r "nuget: Plotly.NET, 2.0.0-beta9"
#r "nuget: Plotly.NET.Interactive, 2.0.0-beta9"
open Plotly.NET
// helper
type TorchTensor with
member t.toArray() =
match t.shape with
| [|n|] -> Array.init (int n) (fun i -> t.[int64 i].ToSingle())
| _ -> failwithf "requires 1-dimensional tensor, got %A" t.shape
let traces = [
for i in 0L..3L ->
let ys = net.Kernel.[i].[0L].toArray()
let xs = [|0..ys.Length|]
Chart.Line(xs, ys)
]
traces
|> Chart.Combine
|> Chart.withSize(1000.,600.)
//|> Chart.Show // not needed for notebook
And that's the graphics for the above (some sin(x)..cos(x) alike kernel components it manages to reconstruct)
@pkese my F# is a bit rusty, but I believe member _.logKernel = Float32Tensor.zeros(...)
defines a property with a getter, whose body is right side of =
. E.g. your code invokes Float32Tensor.zeros
every time .logKernel
is accessed, thereby creating a new tensor.
Of course 🤦♂️. Thanks.
Sorry everyone for my ignorance.
Now this keeps chasing me...
Funny, how I managed to write code in F# for several years without ever getting bitten by this inconsistency in my mental model.
I was diligently writing explicit getters member _.x with get () = ...
when I expected code to get evaluated each time.
On the other hand, I don't often write classes in F# and if the rest of the code was purely functionally, then the result was reproducibly correct.
@pkese do a global search on your projects directory for regex pattern matching this code. E.g. something like member\ +[a-z_]+\.[a-zA-Z_0-9]+\ *\=
. Should ease it a bit or make your hair stand 😄
@lostmsu Ha ha -- I should totally do that :smile: :smile: :smile:
@NiklasGustafsson
Thanks. I'm really looking forward for your new additions to land ... I'm using tensor.copy_(...)
in my Python code for directly manipulating weights (normalizing some vectors) after each backward step. So it's wonderful that this got included in TorchSharp too.
Any chance of adding xlogy
as well (I think they just implemented it in 8.0)?
@pkese, xlogy()
should already be available. I remember adding that a while back.
Oh, indeed there is. I just didn't notice it. Thanks.
I've built a test model with TorchSharp and it's chewing something between 0.5 and 1 GB of memory per second while training until OOM (all running on CPU).
Following the hints in https://github.com/xamarin/TorchSharp/blob/master/docfx/articles/memory.md I'm keeping a list of all tensors that I touch inside each training batch and then I manually call
.Dispose()
on all those referenced tensors after each batch.In addition I do a manual GC.Collect() every N batches.
Are there any additional hints on what to do or how to avoid memory problems?
My models are using
NN.Embedding
andtensor.conv1d
.