dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.4k stars 182 forks source link

Memory Corruption Bug? #121

Closed MKlauck closed 4 years ago

MKlauck commented 5 years ago

Hi,

as already mentioned in the xamarin chat, I observed a strange behavior in TorchSharp regarding memory corruption. It is already known that if I create tensors from C# arrays, I have to manage the disposal of the tensors by myself. In the minimal working test example mentioned below, I made sure that neither the garbage collection nor any other process is able to free or modify the data in the array used to create the tensor because everything is stored in a class variable. But still at some point the memory is leaked.

In OracleInit() a Linear is created and weights are read from a file, parsed as floats and stored in an array. Afterwards these weights are overwritten again by 1. Then a tensor is created from the weights array and the module's weights are set to this tensor.

In OracleTransition() the module is initialized once and afterwards it is checked that the first weight is still 1.

In Main() in Program.cs this check is called in a loop as long as the value is correct. This is the case for several 100s of calls but at some point the weights are overwritten althought we only read from the weights.

What I would expect from the program is that it runs forever because it always checks if the value is still 1 and since the value is not changed anywhere this should always be the case.

The really strange thing is, that if the same program is executed without the lines reading from the file and parsing the values to floats into the array, no error occurs. These two lines should not have any effect on the program behavior because the weights we work with are overwritten directly before beeing used.

You can find the code in this repositiory: https://dgit.cs.uni-saarland.de/Michaela/testtorchsharp To include TorchSharp directly, please clone with --recurse-submodules. When using Visual Studio, TorchSharp has to be built manually once in the beginning.

Thank you very much for your help.

Best, Michaela

interesaaat commented 5 years ago

Thanks Michaela. Indeed if you switch the From with Ones it works so there is definitely something wrong with that. I am looking into it.

One minor thing, if you see when you run your test with Ones, you consume a lot of memory. This is because Weight is not cached and always create a new tensor. If you wrap it around using you will avoid this.

interesaaat commented 5 years ago

Mmm. It's interesting because if you copy the array into the tensor for example by doing fc1_w_tensor = fc1_weights.ToTorchTensor(dim_w_1, doCopy: true); the error persist :(

interesaaat commented 5 years ago

Ok it was indeed a bug. I fixed it, I will submit a PR shortly. Thanks for the issue BTW, I was chasing this bug for a while and your test program as helped a lot!

dsyme commented 4 years ago

@interesaaat I'm trialling an alternative fix to this problem.

Torch lets you pass in a deleter function, see https://pytorch.org/cppdocs/api/function_namespacetorch_1ad7fb2a7759ef8c9443b489ddde494787.html

The deleter function (a std::function<void(void*)>) will be called on the data when the Tensor data would normally be deallocated

So am trialling passing in a delegate. (Currently this allocates one delegate per tensor, I'm not sure if there's anyway to avoid this - we would need a way to go from the array address to the GCHandle?)

                    var dataHandle = GCHandle.Alloc(rawArray, GCHandleType.Pinned);
                    var addr = dataHandle.AddrOfPinnedObject();
                    var gchp = GCHandle.ToIntPtr(dataHandle);
                    var deleter = new GCHandleDeleter((IntPtr ptr) => GCHandle.FromIntPtr(gchp).Free());
                    return new TorchTensor(THSTensor_new(addr, deleter, dimensions, dimensions.Length, (sbyte)ATenScalarMapping.Byte, requiresGrad));
dsyme commented 4 years ago

The fix is contained in https://github.com/xamarin/TorchSharp/pull/127, if it proves sound

dsyme commented 4 years ago

Closing as this was fixed in #127