Closed tqchen closed 7 years ago
I can start with the c++ wrapper once the baby starts to walk
I think a c++ wrapper on top of pure c interface is overkill.
This library is intended to be used through DL frameworks, not standalone. You wouldn't create a frontend binding just for this library, which would be essentially reinventing Torch. The top priority for it should be easy compiling/linking. If possible it should be a header only library with c++11 interface from ground up. Although if we want to support opencl it's hard to make it header only
This library is intended to be used through DL frameworks, not standalone.
@piiswrong It is? Apart from the name of the issue :D
A tensor should have the following attributes: ndim, dtype, device (cpu/gpu/ocl), data_ptr, shape, stride
We need to decide which one goes to field and which one goes to template argument. If ndim is in template argument we enjoy the benefit of on stack allocation of shape/stride. The down side is caller need to switch over ndim
@edgarriba I hope so. Otherwise it's reinventing Torch, which doesn't make much sense. I'm hoping we can solve, or at least mitigate, the problem that there are too many redundant work in DL frameworks, not add to it ;)
Honestly speaking, I would vote for a C interface, not because I personally like C, but because over the years, I have seen a lot of opinions around C++. Tianqi's argument about ABI compatibility is one prominent reason for using C as a compatibility layer. Also, languages such as Python and Torch would need a C FFI in any case, and having C++ on top of C is definitely better than having C on top of C++.
And the heavy use of templates, especially template metaprogramming (I am sure we are not talking about this here, but FWIW) is also pretty controversial. One can find a lot of similar cases in the argument for and against Eigen actually.
It depends on the purpose of the library.
If you want to use it standalone, then having C interface is better.
But if you want to use it through frame works, C++ is much more convenient because you can pass the Tensor object around in your framework and eventually replace your own data blob with a standard Tensor. Using C interface means you have to convert to/from your data blob to Tensor repeatedly in each operator. It's slow and need a ton of wrapper code. For small operators its not worth doing.
Take cub https://github.com/NVlabs/cub for an example, it's a header only library with c++ interface.
Torch's internal THC also uses C++. It's only the interface that's pure C.
Anyway, just my 2 cents :)
I also don't like template metaprogramming, because it's a pain to maintain. Very few people can understand it.
But simple template arguments like template <typename Device, typename ndim>
is OK
I'm against c++ for the abi. let's stick to C and build a c++ wrapper on top if needed. Hourglass interfaces are great
Just to make sure every one is on the same page:
I agree with @piiswrong. Probably an agreement on this kind of general boundings it is required.
this let me recall my first boss, but this is great.
Here is my opinion on @piiswrong 's points, to summarized, I think the advantage of such shared belief is not only for sharing between projects, but also encourage others to help.
My guess is that when there is sharing, there is interface and compatibility issue and we always need to consider what is the minimum interface out there.
I agree with @piiswrong that it might be helpful to have everyone agree on certain things before we proceed. One way to do so is each of us post a code snippet on what the data structure and interface should look like, then we summarize the points that need to be taken.
Once we have enough agreement on what the initial data structure and function signature look, we can open a repo and iterate from there.
Here is my version of strawman's design(based on my summarization of majority opinion here), that summarizes some of the opinions in here.
typedef {
int device_id;
int device_type;
} CContext;
typedef struct {
// alignment can be checked directly in address
void* data;
size_t ndim;
size_t* shape;
// can be nullptr, indicating compact version.
size_t* strides;
// device
CContext ctx;
// data type
int dtype;
} CTensor;
// utilize thread local storage to get last error encountered.
const char* GetLastError();
// used by implementer to set error message.
void SetLastError(const char* msg);
// C API returns 0 when success, -1 when failure,
// -2 indicating contiguous input is needed, ask user to convert before calling again
int BroadCastAdd(CTensor* x, CTensor* y, CTensor* out);
Need we to handle not contiguos cases?
@bhack I would rather not in most cases, and ask user to get a contiguous version before calling(I guess @Yangqing shares the same opinion), but I think majority opinions here want stride as argument.
Suppose you have a c++ interface and source release from venders and now you need a new simple unary op, you can easily do it by defining a new inline function and pass it to template argument.
But if you have c interface and binary from vendor there is no way to do it yourself.
I think we can push for source release from vendors. Nv and intel are already doing this with mkldnn and cub.
On c++ side I was also thinking what kind of limits we could have with a plugin like cross-platform solution in c++
Yeah I agree with TQ that we prefer non strider version, but if things are better strides for performance and or other reasons, it is not a hard constraint - we will need to stride stuff sometimes anyway.
Per Eric's question about C++, I suspect we will never have the vendors give us all the source code for the high performance part :p
@Yangqing But it should be easy in most of the cases to write the compatible wrapper around the high performance code... nice to see you here btw. :)
We can have both versions. Strided and optionally non strided for each function.
Let's see how much source they are willing to give out @ap-hynninen @glingyan @zhenlinluo
What is the device type and id in cases like https://github.com/CNugteren/CLCudaAPI/issues/11 and generally with unified memory devices?
@bhack In my understanding, the UMA does not prevent you from having a primary device id, except that that it will permit operators to take input from different device_id and launch on a certain device
The same thing goes with OpenCL kernels. Although OpenCL creates an illusion that a buffer can belong to any device. Having a primary device id per memory is still helpful. The fact is that the underlying driver will create mirror copies on certain devices, having a tensor primary operate on one device will certainly have the benefits(in terms of scheduling etc)
@tqchen How should offsets inside memory objects/tensors be provided? What if multiple tensors are packed inside one OpenCL buffer object?
i like @tqchen 's structs and APIs.
@naibaf7 *data can be an offsetted pointer?
Yeah offset pointer would be great, and if I may, can we enforce things to be aligned to like 256 or 512? That might make things maximally compatible with multiple optimization mechanisms.
On Sat, Feb 4, 2017 at 5:09 PM Soumith Chintala notifications@github.com wrote:
i like @tqchen https://github.com/tqchen 's structs and APIs.
@naibaf7 https://github.com/naibaf7 *data can be an offsetted pointer?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/mxnet/issues/4735#issuecomment-277489931, or mute the thread https://github.com/notifications/unsubscribe-auth/AAho7x0E9D6byc96UsJ1pV_se6Dp-dGBks5rZSFYgaJpZM4Lobt0 .
-- Sent from Gmail Mobile - apologies for any typoz.
To address @naibaf7 's concerns, and to take into account cleaner Tensor views, here's what I propose (we do this in Torch already):
typedef {
int device_id;
int device_type;
} CContext;
typedef {
void* data; // data pointer
CContext ctx; // device
int dtype; // data type
int refcount;
} Buffer;
typedef struct {
Buffer data;
int offset; // offset in buffer
size_t ndim;
size_t* shape;
size_t* strides; // can be nullptr, indicating compact version.
int refcount;
} CTensor;
This way two problems are solved:
In Torch, we call them Storage
and Tensor
, but whatever naming is more standardized...
I'm not super particular about this change, just proposing it based on feedback.
@soumith Okay, however that's not going so nicely with OpenCL right now. Maybe a separate offset parameter should also be considered. EDIT: Too late, I like your updated suggestion :).
@Yangqing Should such alignment requirements be enforced or should there be metadata for operators that can specify optimal and necessary workspace conditions?
A tensor shouldn't own memory since it's only a wrapper around memory managed by the framework. So you won't need an offset.
You would need an offset even if the Tensor doesn't own memory. The Tensor can only be constructed on a slice of a larger Buffer. This is what @naibaf7 is saying we should cover, and I know from Torch's experience that this comes up very often.
@piiswrong I think that concept is also maintained by what @soumith says. Basically the CTensor and eventually CMatrix or CVector structs carry metadata of a tensor/matrix/vector, one of which is a CBuffer struct that specifies what memory from the host framework is being used. CMatrix and CVector could also be used to specify dimension reduced areas of a tensor in order to simplify the use of BLAS libraries alongside the new DNN operators. The operators can then internally combine the information, such as checking memory requirements and passing buffer+offset into the kernel (OpenCL) or compute the unified memory pointer (CUDA, CPU) and execute the compute kernel.
refcount is not needed if memory container is managed by the framework. Note that you can always add it back and maintain compatibility by sub-classing Tensor. If memory is not managed, maybe Buffer is also not needed (only have a Tensor with possible offset).
@piiswrong offset is useful for things like OpenCL, where the cl_mem object is opague and you cannot do explicit addressing.
I think that we are basically converging on a solution. But with clSVMAlloc what device id we set?
@bhack I think this doesn't need to be of concern for the interface. That kind of memory can be associated with multiple CBuffer/CTensor objects of multiple devices, if the host framework knows that this is safe to do.
@naibaf7 Ok but so the device id is it not an info more oriented to the "executor"?
And what interfaces do you propose for constructors? Which are the required and optionals? Besides, do we want to support tensor with dynamic size or just static?
FYI, in case you want to start to play I've setup this https://godbolt.org/g/YMnxHB
I think the data offset @soumith proposed looks good. Here is one more modification I propose(also incorporated @Yangqing comment about requrement of data alignment):
typedef {
int device_id;
int device_type;
} CContext;
typedef struct {
void* data; // data pointer(or cl_mem handle in OpenCL), align to 256
CContext ctx; // device
int ndim;
int dtype; // data type
size_t offset; // offset in buffer
size_t* shape;
size_t* strides; // can be nullptr, indicating compact version.
} CTensor;
I propose to merge the buffer into the Tensor structure and remove ref counting, since CTensor is not a container type (which manages memory).
As a matter of fact, if we really want the data structure to be able to do memory management, Buffer should not be a member of Tensor, but instead we need Buffer* to be member(so ref count on unique Buffer object can be done correctly). This is a bit overkill when we don't want memory management.
Here is what we can do in frameworks side to bring the memory management back. I use C++ style with shared_ptr for simplicity, allow thread safe destruction(can also use refcount).
// memory container
class TensorContainer: public CTensor {
public:
// This is debatable
// Ideally should be small_vector<4, size_t> shape
// so small shape sits in stack.
// The dimension also duplicates with ndim field.
std::vector<size_t> shape_;
void Reshape(const std::vector<size_t>& shape) {
shape_ = shape;
this->shape = BeginPtr(shape);
this->ndim = shape.size();
// reset memory
}
~TensorContainer() {
if (origin_ == nullptr) {
allocator.delete(data);
}
}
private:
// If it is a view, buffer_ is the pointer to the original container.
// Otherwise, buffer_ is nullptr
std::shared_ptr<TensorContainer> origin_;
};
// handle type that uses the container.
class Tensor {
public:
// can pass to argument that takes CTensor
operator CTensor() const {
const CTensor& dat = *data_;
return dat;
}
// can pass to argument that takes CTensor*
operator CTensor*() const {
return dat.get();
}
private:
std::shared_ptr<TensorContainer> data_;
};
@edgarriba I think there are two concepts here. The CTensor which serves as memory handle(like pointer) which may not have memory management.
And a Container type that does memory management, hopefully via c++1x which wraps the C API. I think on that end. The requirement on dynamic size etc. is on the container type.
So the op will consume or write data only on the device type it finds in device_id right?
I think we are not yet very clear on operator specification. Especially, where does the context get set(e.g. cudaSetDevice), how to specify context resources(e.g. cuDNN handle).
This has been inactive for a few days. I would like to ping back and see if the people involve in this thread are interested in any form of moving forward.
We can start with a neutral repo with the involved project lead as owners(I can do it under dmlc, my own repo, or we can start a org), eventually we can move the repo to a neutral organization.
The first things to be finalized are
I do not have a very good idea on how we should name the project. One possible name could be ctensor (as c indicates common). Please suggest better name candidates you have in mind
@tqchen It Is ok for us (tiny-dnn).. I don't know if it is limited to Tensor for naming cause we are already talking about ops interfaces.
I think we are not restricted to tensor for naming, I just proposed one because I am kinda of running out of ideas for naming, so please suggest names
CTensor
sounds good to me
Common Dnn Api/Core Dnn Api/Common Kernel Api etc..?
@tqchen I'd happy to wrap my LibDNN standalone library with the new interface once a reasonable collection of headers is available, to see how well that would work :) And I guess @edgarriba could test out the host-side code in tiny-dnn?
@bhack What do you mean by that? Probably nothing beyond the interface for operators should be shared in this project?
@naibaf7 I think that the boundaries are operators API that ingest this common tensor design.. I don't think that executors/scheduling will come on the table. What is your vision?
See This Link for discussion repo
This discussion started from https://github.com/dmlc/minpy/issues/129, with @soumith THC is a tensor library that backs torch. I open this issue in MXNet repo so more developers can see it.
First of all, it is possible reuse operator libraries between frameworks, for example
It is always interesting to see interchangeability happen. For example, schedule pytorch operations in mxnet's async engine, or run mxnet's declarative API to directly share data with pytorch's array.
However, there is some engineering obstacles in doing so, which I would like to explain what these obstacles are, and hopefully this can motivate the community to move forward, and make this easier.
Coupled Operator Data Structure Components
An operator can mean many things, here are some basic components on what the operators are:
Why such coupling prevents reuse? There are two reasons
To resolve this problem, an operator library design should enable operators that accept user managed memory resources, when possible, not introduce allocator or resource management, but give hints to the user(CuDNN's workspace requirement eliminates the need to internal memory allocator).
From this point of view, CuDNN an cuBLAS are good examples. THC is nice, but still encapsulate memory allocator(which is needed sometimes for dynamic operators).
Lack of Unified Operator Interface
The second obstacle is mainly lack of common operator interface. This is a problem of CUDNN and THC that prevents reusing. Take CuDNN for example, each CuDNN API is a C function, with its own interface, to adopt the operator, there need to be one(or multiple) adapting function per operator.
Consider instead, if there is an unified operator interface(the following is a mock design), where each TBlob is a reference to the data fields and shape, and every function gets registered to the registry with their name
Then it only takes one function to extract, and reuse all operators and automatically expose them to front end. In MXNet, it even directly generates the symbolic counterpart from the same imperative operator, if gradient is provided.
Problem of One Unified Operator Interface
There is always a flip side of the coin. Assume that we go with a unified operator interface. As a matter of fact, that is what MXNet, TensorFlow and Caffe have done. The problem now becomes what the interface should look like? One trap that framework designer always falls into is that we need one interface that rules them all.
Since one interface rules them all, we want to support all possible operators, what about the ones that need runtime memory allocations? Maybe add memory allocator to it, what about the ones that is asynchronize? In the end, the interface have to include memory-allocator, scheduling module in some way, and that introduces the "Coupled Operator Data Structure Components" problem. The operator interface become deeply coupled with the rest of the framework and not reusable.
A Better Solution: A Few Unified Interfaces
Can we get the best of both worlds, having as few data structures and interfaces as possible, while still not introducing coupling to allocator and scheduling as much as possible? I think the answer is yes and we need to jump out from the ideal of one interface that rules all the operators.
I can categorize the operators roughly in three categories
If we design for general operator interface, the answer will usually looks like type3. However, type 1 and 2 dominates 90%+ of the major operators we are using. If we design one operator interfaces for each type, this problem is solved. So that frameworks can pull and interact with each type in their own way. It is much easier to do things like static memory planning if type1 and type2 are explicitly introduced. This is one additional layer of wrapping on top of THC and CuDNN is is lacking so far.
A registry system like NNVM could come very handy to easily resgister these informations, and get pull out by the libraries.
The Hope
I have always hopped that there is a minimum set of operator interface standard in C++, that can be shared across libraries. I think we have a good idea on what the solution looks like. While most system tends to become opague and coupled, I think this kind of transparent way can help evolve the community in a healthy way. This being said, there is always effort to make these happen. This involves a open discussion on what the interfaces should be and commitment from framework builders. I would really love to see this happen, and that is why I spend more than one hour writing this.
Unfortunately, most frameworks already have kinda of "enough collection of operators", so having a unified operator interface will contribute little to each framework in terms of usability in short term. Naturally this would be given lower priority. That is why commitment is needed to bring this out for longer term benefit