[Discussion] Sharing Operators between DL Frameworks

tqchen commented 7 years ago

This discussion started from https://github.com/dmlc/minpy/issues/129, with @soumith THC is a tensor library that backs torch. I open this issue in MXNet repo so more developers can see it.

First of all, it is possible reuse operator libraries between frameworks, for example

Support for THC and Torch Module was done in Torch Plugin, with interfacing to torch's lua library.
MXNet supports reuse operators from caffe

It is always interesting to see interchangeability happen. For example, schedule pytorch operations in mxnet's async engine, or run mxnet's declarative API to directly share data with pytorch's array.

However, there is some engineering obstacles in doing so, which I would like to explain what these obstacles are, and hopefully this can motivate the community to move forward, and make this easier.

Coupled Operator Data Structure Components

An operator can mean many things, here are some basic components on what the operators are:

Data structure that holds(shape) pointers to the array
Possible memory allocator to handle run-time memory allocation
Resource handles, if external resources is needed
Scheduling related objects if array support synchronize execution

Why such coupling prevents reuse? There are two reasons

Many systems have their own memory allocator and ways of resource handling code.
While having memory allocator enables runtime memory allocations, sometimes memory allocation is not preferred at all(e.g. BLAS calls where all memory are pre-allocated)

To resolve this problem, an operator library design should enable operators that accept user managed memory resources, when possible, not introduce allocator or resource management, but give hints to the user(CuDNN's workspace requirement eliminates the need to internal memory allocator).

From this point of view, CuDNN an cuBLAS are good examples. THC is nice, but still encapsulate memory allocator(which is needed sometimes for dynamic operators).

Lack of Unified Operator Interface

The second obstacle is mainly lack of common operator interface. This is a problem of CUDNN and THC that prevents reusing. Take CuDNN for example, each CuDNN API is a C function, with its own interface, to adopt the operator, there need to be one(or multiple) adapting function per operator.

Consider instead, if there is an unified operator interface(the following is a mock design), where each TBlob is a reference to the data fields and shape, and every function gets registered to the registry with their name

using FCompute = std::function<void (
   array_view<TBlob> ins, array_view<TBlob> outs, map kwargs, stream stream)>

Then it only takes one function to extract, and reuse all operators and automatically expose them to front end. In MXNet, it even directly generates the symbolic counterpart from the same imperative operator, if gradient is provided.

Problem of One Unified Operator Interface

There is always a flip side of the coin. Assume that we go with a unified operator interface. As a matter of fact, that is what MXNet, TensorFlow and Caffe have done. The problem now becomes what the interface should look like? One trap that framework designer always falls into is that we need one interface that rules them all.

Since one interface rules them all, we want to support all possible operators, what about the ones that need runtime memory allocations? Maybe add memory allocator to it, what about the ones that is asynchronize? In the end, the interface have to include memory-allocator, scheduling module in some way, and that introduces the "Coupled Operator Data Structure Components" problem. The operator interface become deeply coupled with the rest of the framework and not reusable.

A Better Solution: A Few Unified Interfaces

Can we get the best of both worlds, having as few data structures and interfaces as possible, while still not introducing coupling to allocator and scheduling as much as possible? I think the answer is yes and we need to jump out from the ideal of one interface that rules all the operators.

I can categorize the operators roughly in three categories

type1: Basic operators: The ones that can do shape inference based on input shape, can take memory pointer, stream and go
type2: Basic+ operators: Same as basic operator, but also need to declare some additional resources(workspace)
type3: Complicated operators: The ones that requires runtime memory allocator, its output shape depends on content of the data.

If we design for general operator interface, the answer will usually looks like type3. However, type 1 and 2 dominates 90%+ of the major operators we are using. If we design one operator interfaces for each type, this problem is solved. So that frameworks can pull and interact with each type in their own way. It is much easier to do things like static memory planning if type1 and type2 are explicitly introduced. This is one additional layer of wrapping on top of THC and CuDNN is is lacking so far.

A registry system like NNVM could come very handy to easily resgister these informations, and get pull out by the libraries.

The Hope

I have always hopped that there is a minimum set of operator interface standard in C++, that can be shared across libraries. I think we have a good idea on what the solution looks like. While most system tends to become opague and coupled, I think this kind of transparent way can help evolve the community in a healthy way. This being said, there is always effort to make these happen. This involves a open discussion on what the interfaces should be and commitment from framework builders. I would really love to see this happen, and that is why I spend more than one hour writing this.

Unfortunately, most frameworks already have kinda of "enough collection of operators", so having a unified operator interface will contribute little to each framework in terms of usability in short term. Naturally this would be given lower priority. That is why commitment is needed to bring this out for longer term benefit

tqchen commented 7 years ago

I also had similar discussion with @Yangqing and @ajtulloch before.

Yangqing commented 7 years ago

Great initiative! I think a lot of components can be shared if we refactor them in simple APIs. Would love to work together on this front.

soumith commented 7 years ago

The fundamental issue into having a Unified interface is that it needs a full buy-in. Anything short will make it a partial or full failure.

For this reason, I think what the CuDNN team did is actually correct.

For this reason, i think focusing on simplicity and reducing the friction of buy-in, and allowing a way to have partial buy-in will make more folks participate.

So, I think we should define:

one function that consumes:
- a void* pointer
- a long* pointer of sizes
- a long* pointer of strides
- nDims and it spits out a common Tensor descriptor
Each frameworks defines a convertor to and from the common descriptor
Frameworks factor out their ops to be used as independently compiled libs

Keeping it stupid and simple like this is the path of least resistance that will get us forward.

I dont feel confident that defining and maintaining a common registry will practically break ground, especially because it has a huge initial overhead from each of the framework writers (who are all busy with their own problems).

What do you guys think?

soumith commented 7 years ago

what I proposed only works for stateless operations initially, but i think that's where we should start from. Defining statefulness right now will lead to disagreements and complications.

tqchen commented 7 years ago

I think stateless is a good starting point(essentially the type 1 operator). I do however, would like to have a few set unified interfaces in someway, and a registry that is decentralized.

So the scenario I hoped looks like

#include <common_nn_op.h>

void InitMXNetOps() {
   for (auto reg: Registry::ListBinaryOps()) {
       register(reg->name, reg->function);
   }
}

This enables one function to import all the operators that is provided in the operator library. It would indeed require a bit of registry code from the operator library side, for example, a wrap around THC or the library @soumith suggested.

This reduces the effort of importing and adapting new operators. However, the interface indeed need to be simple enough, like the one that contains a few tensor data structures.

tqchen commented 7 years ago

As I mentioned earlier, I do not agree on one unified operator interface, but I do like to see if there can be a few candidates that we can agree on. For example, binary operators.

BinaryOp(const Tensor& lhs, const Tensor& rhs, Tensor*out);
BianryOpShape(const Shape& lhs, const Shape& rhs, Shape* out);

The idea is to reduce the overhead of adaptation code, which otherwise need to happen for each operator, and it makes the framework builder harder to switch in.

piiswrong commented 7 years ago

The easiest way of doing this is copy-pasting kernels, which we having been doing for a while.

a blas like interface is a good idea. But this is only worth while if the operator is complicated enough (i.e. longer than the code required to call it...). Sharing elementwise add probably isn't necessary. having a TensorDescriptor like data structure further complicates this since you need to spend 20+ lines constructing these descriptors.

Unified operator interface is in theory the right way to do things, but obviously we all think our own interface is the best interface. So not sure if this will go anywhere anytime soon.

Examples of operators that are worth sharing: broadcast-reduce ops, embedding.

One thing we can do without having to agree to anything is some "principled copy pasta" wiki page where we share operator implementations without necessarily using the same interface. A easily pair testing framework for verifying correctness on top of that is also good.

piiswrong commented 7 years ago

Also instead of sharing compiled code, a header only library where you use all data structure and array indexing as macros that can be redefined for each framework is much easier.

For example mxnet doesn't support stride while torch does. So indexing works differently. This can be solved by defining different macros.

tqchen commented 7 years ago

I see there are two ends of the current existing discussions

Build One Library that everybody will use Can use simple common data structure, with each framework calls into the same function. @soumith 's proposal is a better solution for this end. Since as long as the data structure is agreed, there is no problem of calling into the functions.

The problem is it is hard to convince developer to be fully committed to a shared core library.

Being able to import operators from other libraries The major concern of me to do this is what is the overhead of importing? That is why some simple common interface, as with data structure might be desirable. So the cost of importing does not involve effort per operator, but instead one effort for importing all operators that all frameworks currently define.

What are the set of Interfaces To be clear, I do not think the MXNet's(nor interface of existing frameworks) interface is the best way to do operator sharing. But I do think there are set of cleaner and minimum interfaces that we might agree on, just like we can agree on the data structures

Yangqing commented 7 years ago

Interface wise, if I may - the cudnn type interface is a good start and this is what I have been telling other vendors too. If there are implementations for e.g. OpenCL, OpenGL, Vulkan etc, this will make the frameworks' life much better.

jermainewang commented 7 years ago

I think the interface includes two parts: (1) the function routines (2) tensor data structure. For example, I remember in THC, it has support for stride and offset, which is lacking in MXNet. If we use cudnn's way, then we need to include all these information in function arguments, which may be a problem for future extension.

Yangqing commented 7 years ago

It makes sense to start with a non-strided version, I think - Caffe/2 does not use stride either and assumes a 256-byte alignment for storage.

RE cudnn - I actually think that having a pure C interface is good for extension, since it would make cross-language integration much easier. For example Python C extension doesn't do a very good C++ support.

(Oh by the way, pybind11 is awesome.)

tqchen commented 7 years ago

I am all for C interface for stable ABI, on the other hand. As a matter of fact, almost all dmlc projects interfaces through C API.

It is always possible to have an auxiliary c++ registry if we can categorize the functions, and return the function handles.

bhack commented 7 years ago

Anybody is interested in Tensor API?

bhack commented 7 years ago

You can find the source code of an inital AMD implementation of the standard

tqchen commented 7 years ago

My concern with the Tensor API(and there are a few of its kind) is that they are opaque, not only is this a standardization of Tensor, but it is also a standardization of graph based dl framework

It is quite easy to see motivation of such design, it maximize the possibility of some optimizations, however,
- It means it that manages memory, computation for you
- Eventually you will see that evolves into a deep learning library of its own
- It means huge effort for each vendor to implement all the scheduling and optimizations that could be shared across hardwares(via using a standard deep learning framework), this might benefit big vendors who have more resources, and provide higher chance of lock in, but was bad for vendors who cannot catch up.
- It also means we might loose possibility for imperative execution(if execution have to rely on a graph)

Personally, I think what we really need is separation of these things(Tensor data structure, the computation, memory management, scheduling), adopt unix philosophy, do one module more transparently and interact with others, specifically:

The memory layout of tensor should be transparent, with some indication of variants and layout when necessary in the operation implementations.
Allow the user to pass in the memory, and workspace instead of relying on internal allocator
Make computation as stateless as possible.
Allow easy hookup to existing kernels, layers
- e.g. hook that to Torch's TH Tensor, Caffe's layer and MXNet's TBlob

tqchen commented 7 years ago

I think what we are discussing here is not really to support XYZ features, that job of deep learning frameworks, but to come up with a minimum module that can be shared across frameworks.

As I may quote "Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away"

tqchen commented 7 years ago

seems there is enough interest in this issue. What I may suggest is for some of us to post strawman design of Tensor structure and possible operators and we can comment from these points, these will help things to move toward a concrete direction.

I largely agree with @soumith and @Yangqing on a C language structured minimum tensor object (maybe design preferred compact layout, with optional stride support)

bhack commented 7 years ago

How much hw vendors are involved in openvx neural network extension? Nvidia is in the team list as you see and had a strategy to implement Openvx computing graph over cuda (see VisionWorks). Also Samsung, Intel, Amd, Arm are in the committee. In tiny-dnn, that it is strongly c++1x features oriented we have also evaluated array_ref proposal but I think many here are not interested in the differents c++ standardization efforts.

nouiz commented 7 years ago

I also like the idea. Not sure how much time we can put in this. INR thing that can help all team find the time would be to identify one feature we don't have and others have. If we can tell, we need this and an relatively easy way to get it and is a long term solution is do it, then it will probably get done.

Otherwise I'm pretty sure this will get too low in most people priority stack in a few weeks.

What do you think of trying to do that now?

About the registry of fct ptr, it isn't enough. A big part of time wrapping a lib is in the error handling and other stuff like this. A registry would need to tell the signature of the fct, ... So it will get complicated and not used by many. So I think blas, cudnn like interface of the best.

Ping @lamblin @abergeron @bartvm so they know about this.

Le jeu. 2 févr. 2017 12:41, bhack notifications@github.com a écrit :

How much hw vendors are involved in openvx neural network extension? Nvida is in the team list as you see and had a strategy to implement Openvx computing graph over cuda (see VisionWorks). Also Samsung, Intel, Amd, Arm are in the committee. In tiny-dnn, that it is strongly c++1x feature oriented we also evaluated array_ref https://github.com/kokkos/array_ref/tree/master/proposals proposal but I think many here are not interested in the differents c++ standardization efforts.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277027997, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-_8WCRWh_Ie6tac61WafZkN6Xc-Jks5rYhVFgaJpZM4Lobt0 .

bhack commented 7 years ago

@nouiz What is your opinion of the openvx kernel/node api?

abergeron commented 7 years ago

I don't think it is that interesting since it doesn't have any support for looping or branching.

Also it seems heavily oriented towards image processing.

piiswrong commented 7 years ago

I think anything that tries to manage memory in an opaque way won't get much adoption.

The challenge of a blas like interface is how to minimize wrapping code. Currently you need to write hundreds of lines of code to call convolution with cudnn. For small ops it's completely not worth it.

A C++ interface using templates can still have a blas like philosophy. That sounds more promising.

bhack commented 7 years ago

This is the graph formalism and this the neural network extension overview. It is still provisional and we could work upstream if we want.

bhack commented 7 years ago

/cc @naibaf7

bhack commented 7 years ago

If anybody want to take a look we have an in internal header only tensor and tensor storage under construction. Any feedback it is appreciated.

edgarriba commented 7 years ago

Hi! I like this initiative. Simple C signatures seems fair for everybody. Having this as header only project could make it more portable and easy to plug to whatever framework. For design maybe we can start by sharing some UML prototypes.

piiswrong commented 7 years ago

@bhack Had a quick look. One thing it's missing is the ability to wrap around external memory. Also host data and device data should be in separate tensors. Not every tensor have a host mirror

bhack commented 7 years ago

@piiswrong yes both are in the roadmap.. we can already cover this with upstream CLCUDAPI header using a start and end interator of a c++ container and we have distinct BufferHost and Buffer concepts.

bhack commented 7 years ago

What will be a minimal MVP for a Tensor in the scope of this issue? Is A Tensor interface the first step of a plan? Will each framework needs to handle conversion from/to this "rosetta stone" Tensor?

tqchen commented 7 years ago

Here is what I would recommend

A minimum C style Tensor object, which most functions wraps into, for example

typedef struct {
void* data;
size_t ndim;
size_t* shape;
size_t* strides;
} CTensor;

Optionally, a header only C++ Tensor object that provides automatic conversion from to the C tensor, which might provide some util like shape management (maybe not memory management)

class Tensor {
  public:
     operator CTensor() const;
};

As long as the operator CTensor() is provided. It is likely you can call C API with the same signature without doing manual converison.

abergeron commented 7 years ago

I think a minimum tensor object should have support from strides.

piiswrong commented 7 years ago

Is there any reason for a pure c interface? I think most people will be happy with c++ The problem with c is you have to encode things like data type with flag field instead of template arguments

Tianqi Chen notifications@github.com于2017年2月3日周五下午3:14写道：

Here is what I would recommend

A minimum C style Tensor object, which most functions wraps into, for example

typedef struct { void data; size_t ndim; size_t shape; } CTensor;

Optionally, a header only C++ Tensor object that provides automatic conversion from to the C tensor, which might provide some util like shape management (maybe not memory management)

class Tensor { public: operator CTensor() const; };

As long as the operator CTensor() is provided. It is likely you can call C API with the same signature without doing manual converison.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277388740, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiudKg8vgEOChJhfS6wkhSRbhw8nrdDks5rY7TXgaJpZM4Lobt0 .

bhack commented 7 years ago

So how ops will access to the tensor associated memory (I.e. device, framework, context)?

nouiz commented 7 years ago

I think the MVP for each framework, it is at least for Theano for now, it is to have this interface plus one operation we could reuse.

For example, CTC is in an external repo not in Theano. If the source of CTC offer this interface, then using it while moving it in Theano would be a minimal MVP. But as it is already available, another operation would be better.

Le ven. 3 févr. 2017 17:13, bhack notifications@github.com a écrit :

What will be a minimal MVP for a Tensor in the scope of this issue? Is A Tensor interface the first step of a plan? Will each framework needs to handle conversion from/to this "rosetta stone" Tensor?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277376871, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-2P3igraPl0XWJvRjn-8MaJ-IcyIks5rY6ahgaJpZM4Lobt0 .

edgarriba commented 7 years ago

I vote for C++ and strides

piiswrong commented 7 years ago

Another thing we need to decide is whether ndim and dtype are template argument or field. It depends on whether you what to switch for type outside or inside API

Edgar Riba notifications@github.com于2017年2月3日周五下午3:30写道：

I vote for C++ and strides

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277391385, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiudCLPDkLWHBNktDrDMpizmYsfSgAGks5rY7ijgaJpZM4Lobt0 .

tqchen commented 7 years ago

The main advantage of C API is ABI stability. There is no standard C++ ABI, which means the compiled library can depend on compiler version even on the same platform.

For example, it is quite common to compile CUDA code with MSVC on windows, while link that library from MinGW if you are building an R binding(because R's windows builds on minGW). This is impossible if you use C++. If the code goes C++, essentially only source can be distributed, instead of binary. Which might make it a bit vendor unfriendly, if they want to distribute binary (like CuDNN).

On the other hand, c++1x is great and I think it would be great to have a header only library that wraps the C API, which allows simpler syntax.

tqchen commented 7 years ago

In terms of context, device and resources. There are usually two ways.

Asking the target to return opaque handles for streams and context, pass that back in each function call.
Make the function invariant from the streams and context, but allow runtime setting of these things through a Thread Local storage, which expose a function

The second way might be cleaner, but do have a overhead of fetching a TLS each function call, which is negligible(in micro second level). as a matter of fact, most runtime API like CUDA utilize TLS to make calls threadsafe.

tqchen commented 7 years ago

I think having strides is good, though many function can only support non-strided version, which a failure flag to ask the framework to call MakeContiguous first.

The danger in potential MakeContiguous() is that allocator gets involved(or the library have its own private workspace), which need a bit careful consideration.

naibaf7 commented 7 years ago

I agree with @edgarriba.

But I think in general operators should only be shared alike a BLAS and leave everything else up to the DNN framework. Stride and format of tensors should be kept open, but the DNN operators should specify support for formats alike BLAS libraries.

Unsupported tensors should be resolved using additional DNN operators for in-place and out-of-place tensor reordering.
Leave memory management and tensors themselves up to the framework. Device initialization and bookkeeping as well.
For internal/working memory of operators, there should be the option to pass an allocator/memory manager as lambda function/function pointer into the operator, or the operator can be allowed to allocate with internal/own allocators. This is very important for memory bookkeeping in more complex DNN frameworks. It can be designed a bit like memory managers inside OS kernels.
Operators can define themselves as stateless or stateful, and can also tell the user if it is a single-instance (multi-use) or a multi-instance (single use; needs to be instantiated per network layer) operator.
Stateless functions are alike BLAS functions and do not need instantiation.
Stateful operators need to define life cycle functions that allow to create the operator and reshape, compile, cache, use (forward/backward) and destruct it (and possibly more). Here the host framework has to keep a pointer to the instance of the operator. LibDNN and cuDNN are stateful, BLAS operations are not. Max pooling is stateful, average pooling can be stateless, activations are stateless (etc).
Operators can be single-device or multi-device in their execution.

tqchen commented 7 years ago

For memory interface for complicated operators, I think many of them could be simpler, without an allocator, instead having two functions, CuDNN is actually

A workspace requirement function that have the same signature of the execution function, but allow data field to be nullptr, and returns workspace requirement
A execution function that takes workspace pointer as additional argument.

This cannot cover all the complicated operators(there are some that depends on content of the data) but already include most cases. This removes the need of a memory allocator or lambda function.

naibaf7 commented 7 years ago

@tqchen Gets messy very fast with multi-device operators that are coming up, or if workspace memory needs to be consolidated between multiple operators to save memory. I think it will be hard to get around the memory allocator duality and additional life-cycle functions for stateful operators.

tqchen commented 7 years ago

@naibaf7 I cannot speak for the multi-device operators. But the workspace consolidation problem can be handled easily from the framework side. As a matter of fact, it can even be done statically when a computational graph is available, without relying on dynamic memory allocation

naibaf7 commented 7 years ago

@tqchen If the workspace memory for an operator is fixed maybe, but it's often not; also reshaping a network or operators that switch and autotune algorithms can have dynamic memory requirements. I wouldn't want to take this possibility away for future operators that might come up.

tqchen commented 7 years ago

The assumption is that the workspace for an operator is fixed for the fixed input tensor shape, while the requirement can be re-calculated when the shape goes up. The requirement can be a rough estimation of a maximum space needed, as CuDNN did.

This can always fall back into the dynamic memory approach from the caller side, but leaves that decision to the user of the library.

piiswrong commented 7 years ago

What's the argument against lambs allocator? It's cleaner and more flexible.

Tianqi Chen notifications@github.com于2017年2月3日周五下午4:51写道：

For memory interface for complicated operators, I think many of them could be simpler, without an allocator, instead having two functions, CuDNN is actually

A workspace requirement function that have the same signature of the execution function, but allow data field to be nullptr, and returns workspace requirement

A execution function that takes workspace pointer as additional argument.

This cannot cover all the complicated operators(there are some that depends on content of the data) but already include most cases. This removes the need of a memory allocator or lambda function.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277402694, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiudKLOfwM9RhavhQ8I_D8QzOavv-7wks5rY8uPgaJpZM4Lobt0 .

tqchen commented 7 years ago

It somewhat prevent the chance of static allocation of workspace. The workspace requirement interface is more restrictive, and enables the two phase strategy(allocation then execution)

I think the argument is not against using allocator when necessary, there are some cases where it is inevitable. but instead provide three categorizations

Those that does not requires workspace memory
Those that can declare workspace requirement before execution
Those that need a allocator

The former ones can be relaxed to the later ones. In general putting operator into the most restrictive types leaves chances for the user to decide what to do with them.

naibaf7 commented 7 years ago

@tqchen Yup. I agree with that last statement of yours. Those that do require an allocator can also be kept simple: They can use their internal allocator and destructor for device memory IF the framework does not care about having full memory management administration of devices. For more restricted operators, the two-phase workspace configuration and execution can be incorporated in the life-cycle functions.

edgarriba commented 7 years ago

@tqchen nice! go ahead and create a repo with simple C structures so that we can start to iterate.

apache / mxnet