dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.37k stars 177 forks source link

Any hints on how to diagnose memory issues? #252

Closed pkese closed 3 years ago

pkese commented 3 years ago

I've built a test model with TorchSharp and it's chewing something between 0.5 and 1 GB of memory per second while training until OOM (all running on CPU).

Following the hints in https://github.com/xamarin/TorchSharp/blob/master/docfx/articles/memory.md I'm keeping a list of all tensors that I touch inside each training batch and then I manually call .Dispose() on all those referenced tensors after each batch.

In addition I do a manual GC.Collect() every N batches.

Are there any additional hints on what to do or how to avoid memory problems?

My models are using NN.Embedding and tensor.conv1d.

dsyme commented 3 years ago

This certainly sounds like a bug. Something is not disposing.

I don't have any specific guidelines - can you share the code so we can assess the primitives you're using?

pkese commented 3 years ago

I'll try to extract and share the gist of that code in a way what should keep my employer and their clients happy.

But I may take a day or two (or over the weekend) to find time for that.

NiklasGustafsson commented 3 years ago

I have some slightly updated advice waiting in a PR. The source is here: https://github.com/NiklasGustafsson/TorchSharp/blob/examples/docfx/articles/memory.md

Basically, you have to be very aggressive with GC -- for some training, I have had to set N=1, i.e. call GC.Collect() every batch. Granted, this has been with GPU-based training, so it may be a little bit different, but the general problem is the same.

It's worth taking a look at how data is loaded -- you may have to discard not just temporaries inside the calculation, but also the input tensors after each batch. I've not had to do that on any of the sample data sets, but if you have huge inputs, then you may have to deal with the unfortunate issue of re-loading input from disk every epoch.

Anyway, not sure any of this addresses your problem, it's just some random thoughts.

pkese commented 3 years ago

This turned out to be quite interesing ...

The model works fine, when parameters (logKernel, logScale, logEmbeddings) are defined as private variables,

type Model(device, nTraces, convLen, nComponents) =
    let logKernel = Float32Tensor.rand([|nComponents; 1L; convLen|], device, true)
    let logScale = Float32Tensor.from([|-7.0f|], true)
    let logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )

    member this.parameters = [| logKernel; logScale; logEmbeddings.Weight |]

    member n.forward (xs:TorchTensor, indices:TorchTensor) =
        let factors = (logEmbeddings.forward indices + logScale).exp().unsqueeze(2L)
        let kernel = logKernel.exp()
        let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
        let compOuts = ins.conv1d(kernel, groups=nComponents)
        let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
        outs

however, if these parameters are defined as class properties (below), then the memory issue occurs (consuming about 500 MB per second while training):

type Model(device, nTraces, convLen, nComponents) =
    member _.logKernel = Float32Tensor.zeros([|nComponents; 1L; convLen|], device, true)
    member _.logScale = Float32Tensor.from([|-7.0f|], true)
    member _.logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )

    member n.parameters = [| n.logKernel; n.logScale; n.logEmbeddings.Weight |]

    member n.forward (xs:TorchTensor, indices:TorchTensor) =
        let factors = (n.logEmbeddings.forward indices + n.logScale).exp().unsqueeze(2L)
        let kernel = n.logKernel.exp()
        let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
        let compOuts = ins.conv1d(kernel, groups=nComponents)
        let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
        outs

I'll post the whole notebook somewhere (github doesn't wan't to render the new format .ipynb gists).

pkese commented 3 years ago

Nah, I can't make github show the notebook (even if I delete all the graphics), but I've uploaded the file to https://github.com/pkese/fs-codespace-test/blob/main/TorchSharp-Leak-Test.ipynb

It can be downloaded and shown in notebook editor (I'm using vscode-insiders for editing).

pkese commented 3 years ago

...or that's the code:

#r "nuget: libtorch-cpu, 1.8.0.7"
#r "nuget: TorchSharp, 0.91.52518"

open TorchSharp
open TorchSharp.Tensor
open TorchSharp.NN
let device = Torch.InitializeDevice Device.CPU

(*
#### PCA Deconvolution

We render 1000s of traces with 1D random `xs` and calculated `ys` and wish to infer parameters of transform.

The transform is done by:
- choosing a random mixture of 4 component functions (e.g. sin(x), cos(x))
- storing the mixture into a kernel
- convolving the kernel over `xs` to get `ys`

The task is to reconstruct the:
- shape of 4 components forming kernels
- mixture weights of 4 components for each sample (embedding lookups)

*)

let xsLen = 32
let convLen = 16
let components = [| // component functions that we will later try to reconstruct (deconvolve)
    fun x ->  Math.Sin (x/float convLen*3.14) * 0.5 + 0.5
    fun x ->  Math.Cos (x/float convLen*3.14) * 0.5 + 0.5
    fun x -> -Math.Sin (x/float convLen*3.14) * 0.5 + 0.5
    fun x -> -Math.Cos (x/float convLen*3.14) * 0.5 + 0.5
|]
let nComponents = components.Length

let random = Random()
module SampleGenerator =
    let dirichlet n =
        let ps = Array.init n (fun _ -> random.NextDouble()**2.0)
        let sum = ps |> Array.sum
        ps |> Array.map (fun p -> p / sum)
    let randomComponentMixtureKernel len =
        let mixture = dirichlet components.Length
        let compMix = Array.zip components mixture
        Array.init len (fun i -> compMix |> Array.sumBy (fun (compFn,weight) -> (compFn (float i) * weight)))
    let conv (kernel: float[]) (xs: float[]) =
        Array.init (xs.Length-kernel.Length+1) (fun i ->
            let mutable sum = 0.0
            for j in 0 .. kernel.Length-1 do sum <- sum + xs.[i+j] * kernel.[j]
            sum)
    let flipCoin (rate: float) (trials: float) = Math.Round(trials * rate)
    let renderSample xsLen convLen =
        let kernel = randomComponentMixtureKernel convLen
        let xs = Array.init xsLen (fun _ -> random.Next 1000 |> float)
        let rate = random.NextDouble() * 0.01
        let ys = xs |> conv kernel |> Array.map (flipCoin rate)
        xs, ys

//SampleGenerator.renderSample xsLen convLen

let nTraces = 10000

let renderDataset xsLen convLen nTraces =
    let tensor (xs: float[]) = Float32Tensor.from(Array.map float32 xs, false)
    let xs, ys =
        Array.init nTraces (fun _ -> 
            let xs, ys = SampleGenerator.renderSample xsLen convLen
            tensor xs, tensor ys)
        |> Array.unzip
    let mutable i0=0
    fun batchSize ->
        let xs, ys, indices =
            Array.init batchSize (fun i -> 
                i0 <- (i0+1) % nTraces
                xs.[i0], ys.[i0], i0)
            |> Array.unzip3
        xs.stack 0L, ys.stack 0L, Int32Tensor.from(indices, false)

let generateBatch = renderDataset xsLen convLen nTraces

let inline fscalar x = TorchScalar.op_Implicit (float32 x)
let epsilon = fscalar 10e-12

// this model works fine

type Model(device, nTraces, convLen, nComponents) =
    inherit CustomModule("deconv")
    let nTraces, nComponents, convLen = int64 nTraces, int64 nComponents, int64 convLen
    let logKernel = Float32Tensor.rand([|nComponents; 1L; convLen|], device, true)
    let logScale = Float32Tensor.from([|-7.0f|], true)
    let logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )

    member this.parameters = [| logKernel; logScale; logEmbeddings.Weight |]

    override _.forward (x:TorchTensor) = failwithf "wrong method"

    member n.forward (xs:TorchTensor, indices:TorchTensor) =
        let factors = (logEmbeddings.forward indices + logScale).exp().unsqueeze(2L)
        let kernel = logKernel.exp()
        let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
        //printfn "ins=%A kernel=%A" ins.shape kernel.shape
        let compOuts = ins.conv1d(kernel, groups=nComponents)
        //printfn "conv=%A, factors=%A, kernel=%A" compOuts.shape factors.shape kernel.shape
        let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
        outs

    member _.Kernel with get () = logKernel.exp()
    member _.Scale with get () = logScale
    member n.modelLoss() =
        (logKernel.exp().sum([|2L|], keepDimension=true) - fscalar (convLen/2L)).abs().mean()
        //+ (logEmbeddings.Weight.exp().sum([|1L|], keepDimension=true) - fscalar 1.0).mean()

// this model leaks memory
(*
type Model(device, nTraces, convLen, nComponents) =
    inherit CustomModule("deconv")
    let nTraces, nComponents, convLen = int64 nTraces, int64 nComponents, int64 convLen
    member _.logKernel = Float32Tensor.zeros([|nComponents; 1L; convLen|], device, true)
    member _.logScale = Float32Tensor.from([|-7.0f|], true)
    member _.logEmbeddings = TorchSharp.NN.Modules.Embedding( nTraces, nComponents )

    member n.parameters = [| n.logKernel; n.logScale; n.logEmbeddings.Weight |]

    override _.forward (x:TorchTensor) = failwithf "wrong method"

    member n.forward (xs:TorchTensor, indices:TorchTensor) =
        let factors = (n.logEmbeddings.forward indices + n.logScale).exp().unsqueeze(2L)
        let kernel = n.logKernel.exp()
        let ins = xs.unsqueeze(1L).expand([|-1L; nComponents; -1L;|])
        //printfn "ins=%A kernel=%A" ins.shape kernel.shape
        let compOuts = ins.conv1d(kernel, groups=nComponents)
        //printfn "conv=%A, factors=%A, kernel=%A" compOuts.shape factors.shape kernel.shape
        let outs = (compOuts * factors).sum([|1L|], keepDimension=false) + epsilon //* globalFactors.exp()
        outs

    member n.Kernel with get () = n.logKernel.exp()
    member n.Scale with get () = n.logScale
    member n.modelLoss() =
        (n.logKernel.exp().sum([|2L|], keepDimension=true) - fscalar (convLen/2L)).abs().mean()
*)

let net = new Model(device, nTraces, convLen, nComponents)

let inline xlogy(x:TorchTensor, y:TorchTensor) = x * y.log() // xlogy is not exposed in TorchSharp

let poissonLoss (k:TorchTensor) (mu:TorchTensor) =
    let logPmf = xlogy(k,mu) - (k+fscalar 1.0).lgamma() - mu
    -logPmf

let criterion ys ys' = (poissonLoss ys ys').clamp_max(fscalar 10000.0).mean()

let optimizer = TorchSharp.NN.Optimizer.Adam(net.parameters, 0.02)
//let optimizer = TorchSharp.NN.Optimizer.SGD(net.parameters, 0.1)

let batchSize = 384
let mutable cumLoss = 0.0
let mutable nItems = 0
for i in 0..200000 do
    optimizer.zero_grad()
    let xs, ys, indices = generateBatch batchSize
    let ys' = net.forward(xs,indices)
    let loss = criterion ys ys' + net.modelLoss()
    loss.backward()
    optimizer.step()

    cumLoss <- cumLoss + loss.ToDouble()
    nItems <- nItems + 1
    if i%10000 = 0 then
        System.Console.Write $"step %6d{i}: loss=%.4f{cumLoss / float nItems} loss0=%.4f{loss.ToDouble()} scale=%.4f{net.Scale.ToDouble()}"
        cumLoss <- 0.0
        nItems <- 0
    if i%1000 = 0 then
        GC.Collect()

//#r "nuget: Plotly.NET, 2.0.0-beta9"
#r "nuget: Plotly.NET.Interactive, 2.0.0-beta9"
open Plotly.NET

// helper
type TorchTensor with
    member t.toArray() =
        match t.shape with
        | [|n|] -> Array.init (int n) (fun i -> t.[int64 i].ToSingle())
        | _ -> failwithf "requires 1-dimensional tensor, got %A" t.shape

let traces = [
    for i in 0L..3L ->
        let ys = net.Kernel.[i].[0L].toArray()
        let xs = [|0..ys.Length|]
        Chart.Line(xs, ys)
]
traces
|> Chart.Combine
|> Chart.withSize(1000.,600.)
//|> Chart.Show // not needed for notebook
pkese commented 3 years ago

And that's the graphics for the above (some sin(x)..cos(x) alike kernel components it manages to reconstruct) image

lostmsu commented 3 years ago

@pkese my F# is a bit rusty, but I believe member _.logKernel = Float32Tensor.zeros(...) defines a property with a getter, whose body is right side of =. E.g. your code invokes Float32Tensor.zeros every time .logKernel is accessed, thereby creating a new tensor.

pkese commented 3 years ago

Of course 🤦‍♂️. Thanks.

Sorry everyone for my ignorance.

pkese commented 3 years ago

Now this keeps chasing me...

Funny, how I managed to write code in F# for several years without ever getting bitten by this inconsistency in my mental model.

I was diligently writing explicit getters member _.x with get () = ... when I expected code to get evaluated each time.

On the other hand, I don't often write classes in F# and if the rest of the code was purely functionally, then the result was reproducibly correct.

lostmsu commented 3 years ago

@pkese do a global search on your projects directory for regex pattern matching this code. E.g. something like member\ +[a-z_]+\.[a-zA-Z_0-9]+\ *\=. Should ease it a bit or make your hair stand 😄

pkese commented 3 years ago

@lostmsu Ha ha -- I should totally do that :smile: :smile: :smile:

pkese commented 3 years ago

@NiklasGustafsson

Thanks. I'm really looking forward for your new additions to land ... I'm using tensor.copy_(...) in my Python code for directly manipulating weights (normalizing some vectors) after each backward step. So it's wonderful that this got included in TorchSharp too.

Any chance of adding xlogy as well (I think they just implemented it in 8.0)?

NiklasGustafsson commented 3 years ago

@pkese, xlogy() should already be available. I remember adding that a while back.

pkese commented 3 years ago

Oh, indeed there is. I just didn't notice it. Thanks.