[Discussion] Sharing Operators between DL Frameworks

tqchen commented 7 years ago

This discussion started from https://github.com/dmlc/minpy/issues/129, with @soumith THC is a tensor library that backs torch. I open this issue in MXNet repo so more developers can see it.

First of all, it is possible reuse operator libraries between frameworks, for example

Support for THC and Torch Module was done in Torch Plugin, with interfacing to torch's lua library.
MXNet supports reuse operators from caffe

It is always interesting to see interchangeability happen. For example, schedule pytorch operations in mxnet's async engine, or run mxnet's declarative API to directly share data with pytorch's array.

However, there is some engineering obstacles in doing so, which I would like to explain what these obstacles are, and hopefully this can motivate the community to move forward, and make this easier.

Coupled Operator Data Structure Components

An operator can mean many things, here are some basic components on what the operators are:

Data structure that holds(shape) pointers to the array
Possible memory allocator to handle run-time memory allocation
Resource handles, if external resources is needed
Scheduling related objects if array support synchronize execution

Why such coupling prevents reuse? There are two reasons

Many systems have their own memory allocator and ways of resource handling code.
While having memory allocator enables runtime memory allocations, sometimes memory allocation is not preferred at all(e.g. BLAS calls where all memory are pre-allocated)

To resolve this problem, an operator library design should enable operators that accept user managed memory resources, when possible, not introduce allocator or resource management, but give hints to the user(CuDNN's workspace requirement eliminates the need to internal memory allocator).

From this point of view, CuDNN an cuBLAS are good examples. THC is nice, but still encapsulate memory allocator(which is needed sometimes for dynamic operators).

Lack of Unified Operator Interface

The second obstacle is mainly lack of common operator interface. This is a problem of CUDNN and THC that prevents reusing. Take CuDNN for example, each CuDNN API is a C function, with its own interface, to adopt the operator, there need to be one(or multiple) adapting function per operator.

Consider instead, if there is an unified operator interface(the following is a mock design), where each TBlob is a reference to the data fields and shape, and every function gets registered to the registry with their name

using FCompute = std::function<void (
   array_view<TBlob> ins, array_view<TBlob> outs, map kwargs, stream stream)>

Then it only takes one function to extract, and reuse all operators and automatically expose them to front end. In MXNet, it even directly generates the symbolic counterpart from the same imperative operator, if gradient is provided.

Problem of One Unified Operator Interface

There is always a flip side of the coin. Assume that we go with a unified operator interface. As a matter of fact, that is what MXNet, TensorFlow and Caffe have done. The problem now becomes what the interface should look like? One trap that framework designer always falls into is that we need one interface that rules them all.

Since one interface rules them all, we want to support all possible operators, what about the ones that need runtime memory allocations? Maybe add memory allocator to it, what about the ones that is asynchronize? In the end, the interface have to include memory-allocator, scheduling module in some way, and that introduces the "Coupled Operator Data Structure Components" problem. The operator interface become deeply coupled with the rest of the framework and not reusable.

A Better Solution: A Few Unified Interfaces

Can we get the best of both worlds, having as few data structures and interfaces as possible, while still not introducing coupling to allocator and scheduling as much as possible? I think the answer is yes and we need to jump out from the ideal of one interface that rules all the operators.

I can categorize the operators roughly in three categories

type1: Basic operators: The ones that can do shape inference based on input shape, can take memory pointer, stream and go
type2: Basic+ operators: Same as basic operator, but also need to declare some additional resources(workspace)
type3: Complicated operators: The ones that requires runtime memory allocator, its output shape depends on content of the data.

If we design for general operator interface, the answer will usually looks like type3. However, type 1 and 2 dominates 90%+ of the major operators we are using. If we design one operator interfaces for each type, this problem is solved. So that frameworks can pull and interact with each type in their own way. It is much easier to do things like static memory planning if type1 and type2 are explicitly introduced. This is one additional layer of wrapping on top of THC and CuDNN is is lacking so far.

A registry system like NNVM could come very handy to easily resgister these informations, and get pull out by the libraries.

The Hope

I have always hopped that there is a minimum set of operator interface standard in C++, that can be shared across libraries. I think we have a good idea on what the solution looks like. While most system tends to become opague and coupled, I think this kind of transparent way can help evolve the community in a healthy way. This being said, there is always effort to make these happen. This involves a open discussion on what the interfaces should be and commitment from framework builders. I would really love to see this happen, and that is why I spend more than one hour writing this.

Unfortunately, most frameworks already have kinda of "enough collection of operators", so having a unified operator interface will contribute little to each framework in terms of usability in short term. Naturally this would be given lower priority. That is why commitment is needed to bring this out for longer term benefit

edgarriba commented 7 years ago

I can start with the c++ wrapper once the baby starts to walk

piiswrong commented 7 years ago

I think a c++ wrapper on top of pure c interface is overkill.

This library is intended to be used through DL frameworks, not standalone. You wouldn't create a frontend binding just for this library, which would be essentially reinventing Torch. The top priority for it should be easy compiling/linking. If possible it should be a header only library with c++11 interface from ground up. Although if we want to support opencl it's hard to make it header only

edgarriba commented 7 years ago

This library is intended to be used through DL frameworks, not standalone.

@piiswrong It is? Apart from the name of the issue :D

piiswrong commented 7 years ago

A tensor should have the following attributes: ndim, dtype, device (cpu/gpu/ocl), data_ptr, shape, stride

We need to decide which one goes to field and which one goes to template argument. If ndim is in template argument we enjoy the benefit of on stack allocation of shape/stride. The down side is caller need to switch over ndim

piiswrong commented 7 years ago

@edgarriba I hope so. Otherwise it's reinventing Torch, which doesn't make much sense. I'm hoping we can solve, or at least mitigate, the problem that there are too many redundant work in DL frameworks, not add to it ;)

Yangqing commented 7 years ago

Honestly speaking, I would vote for a C interface, not because I personally like C, but because over the years, I have seen a lot of opinions around C++. Tianqi's argument about ABI compatibility is one prominent reason for using C as a compatibility layer. Also, languages such as Python and Torch would need a C FFI in any case, and having C++ on top of C is definitely better than having C on top of C++.

And the heavy use of templates, especially template metaprogramming (I am sure we are not talking about this here, but FWIW) is also pretty controversial. One can find a lot of similar cases in the argument for and against Eigen actually.

piiswrong commented 7 years ago

It depends on the purpose of the library.

If you want to use it standalone, then having C interface is better.

But if you want to use it through frame works, C++ is much more convenient because you can pass the Tensor object around in your framework and eventually replace your own data blob with a standard Tensor. Using C interface means you have to convert to/from your data blob to Tensor repeatedly in each operator. It's slow and need a ton of wrapper code. For small operators its not worth doing.

Take cub https://github.com/NVlabs/cub for an example, it's a header only library with c++ interface.

Torch's internal THC also uses C++. It's only the interface that's pure C.

Yangqing commented 7 years ago

Anyway, just my 2 cents :)

piiswrong commented 7 years ago

I also don't like template metaprogramming, because it's a pain to maintain. Very few people can understand it.

But simple template arguments like template <typename Device, typename ndim> is OK

soumith commented 7 years ago

I'm against c++ for the abi. let's stick to C and build a c++ wrapper on top if needed. Hourglass interfaces are great

bhack commented 7 years ago

Just to add more info on the ABI compatibility issue.

bhack commented 7 years ago

A resource for Hourglass

piiswrong commented 7 years ago

Just to make sure every one is on the same page:

How do you think this lib will be used? From framework backend? From frontend? Header only? Compiled together with framework? Compiled separately and linked to framework?
ABI is only relevant if you want to distribute this lib as binaries and link it against frameworks dynamically. What's the use case for this?
What kinds of operators do you think this lib should cover? elementwise? broadcast/reduce? NN operators?

bhack commented 7 years ago

I agree with @piiswrong. Probably an agreement on this kind of general boundings it is required.

kernel8liang commented 7 years ago

this let me recall my first boss, but this is great.

tqchen commented 7 years ago

Here is my opinion on @piiswrong 's points, to summarized, I think the advantage of such shared belief is not only for sharing between projects, but also encourage others to help.

This library will be used from back-end, the standard data structure will be header only.
- Additional library can be compiled and linked optionally
- This is important to encourage not only the participant in this issue to conform to the standard, but also encourage third party vendors to do so, and expose binaries.
ABI compatibility is great if some of the library can be compiled and linked separately for vendors to implement some of the pre-defined interfaces.
I think the basic agreement is on the common tensor structure, plus the categorization principle of the operators.
- Ideally I think it should cover ewise, broadcast/reduce and NN
- We should also encourage implementers to conform to the same data structure, and possible recommended signatures,

My guess is that when there is sharing, there is interface and compatibility issue and we always need to consider what is the minimum interface out there.

tqchen commented 7 years ago

I agree with @piiswrong that it might be helpful to have everyone agree on certain things before we proceed. One way to do so is each of us post a code snippet on what the data structure and interface should look like, then we summarize the points that need to be taken.

Once we have enough agreement on what the initial data structure and function signature look, we can open a repo and iterate from there.

Here is my version of strawman's design(based on my summarization of majority opinion here), that summarizes some of the opinions in here.

typedef {
  int device_id;
  int device_type;
} CContext;

typedef struct {
   // alignment can be checked directly in address
   void* data;
   size_t ndim;
   size_t* shape;
   // can be nullptr, indicating compact version.
   size_t* strides;
   // device
   CContext ctx;
   // data type
   int dtype;
} CTensor;

// utilize thread local storage to get last error encountered.
const char* GetLastError();
// used by implementer to set error message.
void SetLastError(const char* msg);
// C API returns 0 when success, -1 when failure, 
//  -2 indicating contiguous input is needed, ask user to convert before calling again
int BroadCastAdd(CTensor* x, CTensor* y, CTensor* out);

bhack commented 7 years ago

Need we to handle not contiguos cases?

tqchen commented 7 years ago

@bhack I would rather not in most cases, and ask user to get a contiguous version before calling(I guess @Yangqing shares the same opinion), but I think majority opinions here want stride as argument.

piiswrong commented 7 years ago

Suppose you have a c++ interface and source release from venders and now you need a new simple unary op, you can easily do it by defining a new inline function and pass it to template argument.

But if you have c interface and binary from vendor there is no way to do it yourself.

I think we can push for source release from vendors. Nv and intel are already doing this with mkldnn and cub.

bhack commented 7 years ago

On c++ side I was also thinking what kind of limits we could have with a plugin like cross-platform solution in c++

Yangqing commented 7 years ago

Yeah I agree with TQ that we prefer non strider version, but if things are better strides for performance and or other reasons, it is not a hard constraint - we will need to stride stuff sometimes anyway.

Per Eric's question about C++, I suspect we will never have the vendors give us all the source code for the high performance part :p

naibaf7 commented 7 years ago

@Yangqing But it should be easy in most of the cases to write the compatible wrapper around the high performance code... nice to see you here btw. :)

piiswrong commented 7 years ago

We can have both versions. Strided and optionally non strided for each function.

Let's see how much source they are willing to give out @ap-hynninen @glingyan @zhenlinluo

bhack commented 7 years ago

What is the device type and id in cases like https://github.com/CNugteren/CLCudaAPI/issues/11 and generally with unified memory devices?

tqchen commented 7 years ago

@bhack In my understanding, the UMA does not prevent you from having a primary device id, except that that it will permit operators to take input from different device_id and launch on a certain device

The same thing goes with OpenCL kernels. Although OpenCL creates an illusion that a buffer can belong to any device. Having a primary device id per memory is still helpful. The fact is that the underlying driver will create mirror copies on certain devices, having a tensor primary operate on one device will certainly have the benefits(in terms of scheduling etc)

naibaf7 commented 7 years ago

@tqchen How should offsets inside memory objects/tensors be provided? What if multiple tensors are packed inside one OpenCL buffer object?

soumith commented 7 years ago

i like @tqchen 's structs and APIs.

@naibaf7 *data can be an offsetted pointer?

Yangqing commented 7 years ago

Yeah offset pointer would be great, and if I may, can we enforce things to be aligned to like 256 or 512? That might make things maximally compatible with multiple optimization mechanisms.

On Sat, Feb 4, 2017 at 5:09 PM Soumith Chintala notifications@github.com wrote:

i like @tqchen https://github.com/tqchen 's structs and APIs.

@naibaf7 https://github.com/naibaf7 *data can be an offsetted pointer?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277489931, or mute the thread https://github.com/notifications/unsubscribe-auth/AAho7x0E9D6byc96UsJ1pV_se6Dp-dGBks5rZSFYgaJpZM4Lobt0 .

-- Sent from Gmail Mobile - apologies for any typoz.

soumith commented 7 years ago

To address @naibaf7 's concerns, and to take into account cleaner Tensor views, here's what I propose (we do this in Torch already):

typedef {
  int device_id;
  int device_type;
} CContext;

typedef {
   void* data;         // data pointer
   CContext ctx;       // device
   int dtype;          // data type
   int refcount;
} Buffer;

typedef struct {
   Buffer data; 
   int offset;             // offset in buffer
   size_t ndim;
   size_t* shape;
   size_t* strides;    // can be nullptr, indicating compact version.
   int refcount;
} CTensor;

This way two problems are solved:

buffer offsets are explicit
When you have to create a View "B" on a Tensor "A", you simply point B to the underlying Buffer of A. That way, A can be freed anytime, independent of B.

In Torch, we call them Storage and Tensor, but whatever naming is more standardized...

I'm not super particular about this change, just proposing it based on feedback.

naibaf7 commented 7 years ago

@soumith Okay, however that's not going so nicely with OpenCL right now. Maybe a separate offset parameter should also be considered. EDIT: Too late, I like your updated suggestion :).

@Yangqing Should such alignment requirements be enforced or should there be metadata for operators that can specify optimal and necessary workspace conditions?

piiswrong commented 7 years ago

A tensor shouldn't own memory since it's only a wrapper around memory managed by the framework. So you won't need an offset.

soumith commented 7 years ago

You would need an offset even if the Tensor doesn't own memory. The Tensor can only be constructed on a slice of a larger Buffer. This is what @naibaf7 is saying we should cover, and I know from Torch's experience that this comes up very often.

naibaf7 commented 7 years ago

@piiswrong I think that concept is also maintained by what @soumith says. Basically the CTensor and eventually CMatrix or CVector structs carry metadata of a tensor/matrix/vector, one of which is a CBuffer struct that specifies what memory from the host framework is being used. CMatrix and CVector could also be used to specify dimension reduced areas of a tensor in order to simplify the use of BLAS libraries alongside the new DNN operators. The operators can then internally combine the information, such as checking memory requirements and passing buffer+offset into the kernel (OpenCL) or compute the unified memory pointer (CUDA, CPU) and execute the compute kernel.

tqchen commented 7 years ago

refcount is not needed if memory container is managed by the framework. Note that you can always add it back and maintain compatibility by sub-classing Tensor. If memory is not managed, maybe Buffer is also not needed (only have a Tensor with possible offset).

@piiswrong offset is useful for things like OpenCL, where the cl_mem object is opague and you cannot do explicit addressing.

bhack commented 7 years ago

I think that we are basically converging on a solution. But with clSVMAlloc what device id we set?

naibaf7 commented 7 years ago

@bhack I think this doesn't need to be of concern for the interface. That kind of memory can be associated with multiple CBuffer/CTensor objects of multiple devices, if the host framework knows that this is safe to do.

bhack commented 7 years ago

@naibaf7 Ok but so the device id is it not an info more oriented to the "executor"?

edgarriba commented 7 years ago

And what interfaces do you propose for constructors? Which are the required and optionals? Besides, do we want to support tensor with dynamic size or just static?

FYI, in case you want to start to play I've setup this https://godbolt.org/g/YMnxHB

tqchen commented 7 years ago

I think the data offset @soumith proposed looks good. Here is one more modification I propose(also incorporated @Yangqing comment about requrement of data alignment):

typedef {
  int device_id;
  int device_type;
} CContext;

typedef struct {
   void* data;           // data pointer(or cl_mem handle in OpenCL), align to 256 
   CContext ctx;       // device
   int ndim;
   int dtype;              // data type
   size_t offset;         // offset in buffer
   size_t* shape;
   size_t* strides;    // can be nullptr, indicating compact version.
} CTensor;

I propose to merge the buffer into the Tensor structure and remove ref counting, since CTensor is not a container type (which manages memory).

As a matter of fact, if we really want the data structure to be able to do memory management, Buffer should not be a member of Tensor, but instead we need Buffer* to be member(so ref count on unique Buffer object can be done correctly). This is a bit overkill when we don't want memory management.

Here is what we can do in frameworks side to bring the memory management back. I use C++ style with shared_ptr for simplicity, allow thread safe destruction(can also use refcount).

// memory container
class TensorContainer: public CTensor {
  public:
    // This is debatable
    // Ideally should be small_vector<4, size_t> shape
    // so small shape sits in stack.
    // The dimension also duplicates with ndim field.
    std::vector<size_t> shape_;
    void Reshape(const std::vector<size_t>& shape) {
       shape_ = shape;
       this->shape = BeginPtr(shape);
       this->ndim = shape.size();
       // reset memory
    }
    ~TensorContainer() {
       if (origin_ == nullptr) {
         allocator.delete(data);
       }
    } 
  private:
    // If it is a view, buffer_ is the pointer to the original container.
    // Otherwise, buffer_ is nullptr
    std::shared_ptr<TensorContainer> origin_;
};

// handle type that uses the container.
class Tensor  {
  public: 
     // can pass to argument that takes CTensor
    operator CTensor() const { 
      const CTensor& dat = *data_;
      return dat; 
    }     
     // can pass to argument that takes CTensor*
    operator CTensor*() const { 
      return dat.get();
    }     
  private:
    std::shared_ptr<TensorContainer> data_;
};

tqchen commented 7 years ago

@edgarriba I think there are two concepts here. The CTensor which serves as memory handle(like pointer) which may not have memory management.

And a Container type that does memory management, hopefully via c++1x which wraps the C API. I think on that end. The requirement on dynamic size etc. is on the container type.

bhack commented 7 years ago

So the op will consume or write data only on the device type it finds in device_id right?

tqchen commented 7 years ago

I think we are not yet very clear on operator specification. Especially, where does the context get set(e.g. cudaSetDevice), how to specify context resources(e.g. cuDNN handle).

It can be done by a thread local workspace that can be mutated by setter(e.g. cublasSetStream)
Alternatively, this can be done by an additional handle argument (which could make interface a bit more complicated)

tqchen commented 7 years ago

This has been inactive for a few days. I would like to ping back and see if the people involve in this thread are interested in any form of moving forward.

We can start with a neutral repo with the involved project lead as owners(I can do it under dmlc, my own repo, or we can start a org), eventually we can move the repo to a neutral organization.

The first things to be finalized are

The tensor interface
Example operator interface in three category
The C++ wrapper of the interface

I do not have a very good idea on how we should name the project. One possible name could be ctensor (as c indicates common). Please suggest better name candidates you have in mind

bhack commented 7 years ago

@tqchen It Is ok for us (tiny-dnn).. I don't know if it is limited to Tensor for naming cause we are already talking about ops interfaces.

tqchen commented 7 years ago

I think we are not restricted to tensor for naming, I just proposed one because I am kinda of running out of ideas for naming, so please suggest names

edgarriba commented 7 years ago

CTensor sounds good to me

bhack commented 7 years ago

Common Dnn Api/Core Dnn Api/Common Kernel Api etc..?

naibaf7 commented 7 years ago

@tqchen I'd happy to wrap my LibDNN standalone library with the new interface once a reasonable collection of headers is available, to see how well that would work :) And I guess @edgarriba could test out the host-side code in tiny-dnn?

@bhack What do you mean by that? Probably nothing beyond the interface for operators should be shared in this project?

bhack commented 7 years ago

@naibaf7 I think that the boundaries are operators API that ingest this common tensor design.. I don't think that executors/scheduling will come on the table. What is your vision?

apache / mxnet