[BUG] Stack Smashing on stack allocated tensors

Dando18 commented 5 years ago

Describe the bug Passing stack allocated tensor objects to some MagmaDNN functions causes stack smashing errors. The bug is somewhat unpredictable. It does not happen every run.

To Reproduce

model::NeuralNetwork<float> model( ... );

/* stack allocated */
Tensor<float> x( ... );
Tensor<float> y( ... );

/* this causes a stack smashing error, a non-negligent percentage of runs */
model.fit(&x, &y, ... );

Expected behavior No stack smashing.

Environment:

OS: Ubuntu 18.10
CUDA Version: 10.1
CBLAS: OpenBLAS
Magma Version: 2.5.2
CuDNN Version: 7.6
MagmaDNN Version 1.0

Additional context Due to magmadnn::Tensor<T>'s copy constructor and MemoryManager copying.

Dando18 commented 5 years ago

A new branch has been created to overhaul MagmaDNN's memory system.

New Branch: memory-fixes

The goal of this branch is to replace most pointer use in MagmaDNN with references or smart pointers. This aligns with a modern c++ approach.

Dando18 commented 5 years ago

Here as an outline of how MagmaDNN's memory management will be refactored into a more modern c++ style.

Remove pointers where possible

In MagmaDNN v1.0, just about everything works in pointers. All of the ::magmadnn::math functions take Tensor<T>* pointers rather than plain tensors. This was necessary due to the lack of copy/move/assign semantics for the ::magmadnn::Tensor<T> class. Also operators and other tensor containers allocate, instantiate, and manage tensor pointers.

This is C-style and an unsafe programming style. Modern C++ favors references over pointers. Where resource management is necessary (::magmadnn::MemoryManager) it is favorable to use smart pointers and rather than raw pointers.

First we shall change Tensors and the MemoryManager class to use smart pointers. Tensors will use reference counting to avoid copying of data pointers.

Then proper copy/move/assign semantics are required for the Tensor class. This should be simple if std::shared_ptr is used for the MemoryManager. Then default copying should suffice.

Finally, the ::magmadnn::math library needs to be rewritten to utilize tensor references rather than pointers. This is a more appropriate C++ style interface. It also removes the possibility of nullptr values.

An example func:

void add(const Tensor& a, const Tensor& b, Tensor& out) { ... }

Operations will still needs pointers to children and parents. This is inherent to the graph data structure. However, rather than ::magmadnn::op::add( ... ) returning a raw allocated pointer, it should now add a std::unique_ptr to a central Graph class and return the non-owner raw pointer. The raw pointer can now be used in an algorithmic sense rather than for resource management.

Const Correctness

A significant portion of MagmaDNN routines are not const correct. We will be enforcing const correctness on member functions and ::magmadnn::math function parameters. This is for type safety, to prevent bugs, and provide a cleaner more complete interface.

Remove Type Templating

Currently Operations and math functions are restricted by type. We will remove the type template argument from most of these. For example, ::magmadnn::Tensor<T> will become just ::magmadnn::Tensor. Data type will be signified by some enum DataType, which specifies the supported values.

Valgrind and Cuda-memcheck Testing

We will run and check various MagmaDNN routines against valgrind and cuda-memcheck to be certain of its memory behaviour and performance.

Dando18 commented 5 years ago

Just to comment on the above memory strategy proposed for MagmaDNN: it will break a lot (pretty much all) existing code. This is unfortunate, but the memory update is necessary for growing the framework and there are also not that many users currently.

MagmaDNN / magmadnn