[Discussion] Sharing Operators between DL Frameworks

tqchen commented 7 years ago

This discussion started from https://github.com/dmlc/minpy/issues/129, with @soumith THC is a tensor library that backs torch. I open this issue in MXNet repo so more developers can see it.

First of all, it is possible reuse operator libraries between frameworks, for example

Support for THC and Torch Module was done in Torch Plugin, with interfacing to torch's lua library.
MXNet supports reuse operators from caffe

It is always interesting to see interchangeability happen. For example, schedule pytorch operations in mxnet's async engine, or run mxnet's declarative API to directly share data with pytorch's array.

However, there is some engineering obstacles in doing so, which I would like to explain what these obstacles are, and hopefully this can motivate the community to move forward, and make this easier.

Coupled Operator Data Structure Components

An operator can mean many things, here are some basic components on what the operators are:

Data structure that holds(shape) pointers to the array
Possible memory allocator to handle run-time memory allocation
Resource handles, if external resources is needed
Scheduling related objects if array support synchronize execution

Why such coupling prevents reuse? There are two reasons

Many systems have their own memory allocator and ways of resource handling code.
While having memory allocator enables runtime memory allocations, sometimes memory allocation is not preferred at all(e.g. BLAS calls where all memory are pre-allocated)

To resolve this problem, an operator library design should enable operators that accept user managed memory resources, when possible, not introduce allocator or resource management, but give hints to the user(CuDNN's workspace requirement eliminates the need to internal memory allocator).

From this point of view, CuDNN an cuBLAS are good examples. THC is nice, but still encapsulate memory allocator(which is needed sometimes for dynamic operators).

Lack of Unified Operator Interface

The second obstacle is mainly lack of common operator interface. This is a problem of CUDNN and THC that prevents reusing. Take CuDNN for example, each CuDNN API is a C function, with its own interface, to adopt the operator, there need to be one(or multiple) adapting function per operator.

Consider instead, if there is an unified operator interface(the following is a mock design), where each TBlob is a reference to the data fields and shape, and every function gets registered to the registry with their name

using FCompute = std::function<void (
   array_view<TBlob> ins, array_view<TBlob> outs, map kwargs, stream stream)>

Then it only takes one function to extract, and reuse all operators and automatically expose them to front end. In MXNet, it even directly generates the symbolic counterpart from the same imperative operator, if gradient is provided.

Problem of One Unified Operator Interface

There is always a flip side of the coin. Assume that we go with a unified operator interface. As a matter of fact, that is what MXNet, TensorFlow and Caffe have done. The problem now becomes what the interface should look like? One trap that framework designer always falls into is that we need one interface that rules them all.

Since one interface rules them all, we want to support all possible operators, what about the ones that need runtime memory allocations? Maybe add memory allocator to it, what about the ones that is asynchronize? In the end, the interface have to include memory-allocator, scheduling module in some way, and that introduces the "Coupled Operator Data Structure Components" problem. The operator interface become deeply coupled with the rest of the framework and not reusable.

A Better Solution: A Few Unified Interfaces

Can we get the best of both worlds, having as few data structures and interfaces as possible, while still not introducing coupling to allocator and scheduling as much as possible? I think the answer is yes and we need to jump out from the ideal of one interface that rules all the operators.

I can categorize the operators roughly in three categories

type1: Basic operators: The ones that can do shape inference based on input shape, can take memory pointer, stream and go
type2: Basic+ operators: Same as basic operator, but also need to declare some additional resources(workspace)
type3: Complicated operators: The ones that requires runtime memory allocator, its output shape depends on content of the data.

If we design for general operator interface, the answer will usually looks like type3. However, type 1 and 2 dominates 90%+ of the major operators we are using. If we design one operator interfaces for each type, this problem is solved. So that frameworks can pull and interact with each type in their own way. It is much easier to do things like static memory planning if type1 and type2 are explicitly introduced. This is one additional layer of wrapping on top of THC and CuDNN is is lacking so far.

A registry system like NNVM could come very handy to easily resgister these informations, and get pull out by the libraries.

The Hope

I have always hopped that there is a minimum set of operator interface standard in C++, that can be shared across libraries. I think we have a good idea on what the solution looks like. While most system tends to become opague and coupled, I think this kind of transparent way can help evolve the community in a healthy way. This being said, there is always effort to make these happen. This involves a open discussion on what the interfaces should be and commitment from framework builders. I would really love to see this happen, and that is why I spend more than one hour writing this.

Unfortunately, most frameworks already have kinda of "enough collection of operators", so having a unified operator interface will contribute little to each framework in terms of usability in short term. Naturally this would be given lower priority. That is why commitment is needed to bring this out for longer term benefit

mli commented 7 years ago

how about libop?

i prefer to not use tensor because mathematically tensor has a rich of properties, while most operators we are using are just elemental-wise, so n-dimensional array is better to name the data structure.

Yangqing commented 7 years ago

Following the idea of BLAS we can probably call it BDAS (basic deep-learning algebra subprograms) - sounds like "badass".

piiswrong commented 7 years ago

LOL. In that spirit how about BAsic Neural Artificial Network Algebra Subroutines (BANANAS)

mli commented 7 years ago

Or Deep Learning PACKage, DLPACK, motivated from LAPACK. We are providing more than basic subprograms.

On Sun, Feb 12, 2017 at 11:28 PM, Yangqing Jia notifications@github.com wrote:

Following the idea of BLAS we can probably call it BDAS (basic deep-learning algebra subprograms) - sounds like "badass".

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-279312506, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZv4RjloVzmTaOnuGKOIPLiBDtTBCQ4ks5rcAYHgaJpZM4Lobt0 .

bhack commented 7 years ago

DLPACK is not so bad.

futurely commented 7 years ago

TensorFlow & Keras combined have the largest user base and are growing most rapidly. You should bring those guys on board for this proposal to make the biggest impact.

http://www.timqian.com/star-history/#tensorflow/tensorflow&fchollet/keras&dmlc/mxnet&BVLC/caffe&Microsoft/CNTK&torch/torch7&Theano/Theano

bhack commented 7 years ago

/cc @fchollet

cyberfire commented 7 years ago

Just two points:

Before naming the project, it should be clear about the targets and boundaries of this project clear first: how many issues are there and what issues are to be resolved at what stage.
Using an agile method to start from easy targets: those are probably to be resolved quickly and easy to get agreement from all participants.

jli05 commented 7 years ago

As a side topic, I personally think how to allow MXNet scale out over micro-kernel multi-server OSes and scale down on limited-battery devices is also important.

tqchen commented 7 years ago

created a repo here https://github.com/dmlc/dlpack

tqchen commented 7 years ago

Let us move the discussion to https://github.com/dmlc/dlpack/issues,

apache / mxnet