Open reminisce opened 4 years ago
Hi @reminisce , I have several questions.
To my knowledge, if using pybind11 or Python’s C extension, it need to re-build interfaces to support other version of Python.
I have a good idea, I think we can build our own custom ctypes system or some other middleware to solve it. We need a unified solution to solve all language bindings. https://docs.microsoft.com/en-us/visualstudio/python/working-with-c-cpp-python-in-visual-studio?view=vs-2019
Approach | Vintage | Representative user(s) | Pro(s) | Con(s) |
---|---|---|---|---|
C/C++ extension modules for CPython | 1991 | Standard Library | Extensive documentation and tutorials. Total control. | Compilation, portability, reference management. High C knowledge. |
PyBind11 (Recommended for C++) | 2015 | Lightweight, header-only library for creating Python bindings of existing C++ code. Few dependencies. PyPy compatibility. | Newer, less mature. Heavy use of C++11 features. Short list of supported compilers (Visual Studio is included). | |
Cython (Recommended for C) | 2007 | gevent, kivy | Python-like. Highly mature. High performance. | Compilation, new syntax, new toolchain. |
Boost.Python | 2002 | Works with just about every C++ compiler. | Large and complex suite of libraries; contains many workarounds for old compilers. | |
ctypes | 2003 | oscrypto | No compilation, wide availability. | Accessing and mutating C structures cumbersome and error prone. |
SWIG | 1996 | crfsuite | Generate bindings for many languages at once. | Excessive overhead if Python is the only target. |
cffi | 2013 | cryptography, pypy | Ease of integration, PyPy compatibility. | Newer, less mature. |
cppyy | 2017 | Similar to cffi using C++. | Newer, may have some issues with VS 2017. |
@wkcn that can be done as part of setup.py
locally on the users machine, or would you see any problem?
How much engineering efforts does it take to do (1)?
I think we can first replace the code-gen here with pybind11. https://github.com/apache/incubator-mxnet/blob/521c477ad32864d887481abf6c53acae3b717cf6/python/mxnet/ndarray/register.py#L115-L269
Here is another candidate that I highly recommend: adopt TVM's FFI convention.
The historical problem of MXNet FFI was the blowing amount of the C API bindings as we add new features. This creates a huge amount of maintenance burden.
The real problem was not really about which FFI system to adopt(cython and pybind are fine in that end, except for the cost of compilation), but more of the cost to maintain the FFI. MXNet used to have a fast cython binding, but that was abandoned because we keep add new APIs we cannot keep up both ctypes and cython.
When developing TVM we learnt from the lesson and restrict the API to a limited set of runtime APIs that does not change, and have a stable cython, ctypes binding for them. The runtime support a type-erased function(PackedFunc), which can be efficiently called from any of the frontend language, and all the APIs are exposed through the PackedFunc. On the python side an additional wrapping is created for better documentation and call into the PackedFunc. See more in https://docs.tvm.ai/dev/runtime.html The system works great for over a few years now.
Of course I understand there has been legacy issues in MXNet that is why I did not bring this proposal up. But given this is a proposal for 2.0, I would encourage everyone to give a serious thought about this possibility.
@tqchen thanks for sharing this! Is there any reference benchmark result that stress tests the ffi overhead?
I don't have any benchmarks at hand, but would be great if someone can help creating one with cython enabled.
OK here is a script to quickly check. Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case.
import timeit
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)
Results on my laptop(macbook pro 13inch)
$ TVM_FFI=cython python3 test.py
9.749500000000299e-08
$ TVM_FFI=ctypes python3 test.py
7.226789000000011e-06
@tqchen can you also copy the results you posted in #14883 to here? This shows convincing benchmark results that tvm ffi is comparable with numpy.
I think interested folks can follow the last post here https://github.com/apache/incubator-mxnet/issues/14883#issuecomment-567726765 the general takeaway seems to be that the overhead is quite close to numpy's one
@tqchen Thanks for sharing the benchmark results. We did consider using TVM FFI as a candidate and I strongly agree with your suggestion on making a limited set of runtime API for sustainable maintainability. The Python op API overhead in general is more obvious when passing Python native data structures to the backend, such as lists, tuples, etc. I modified your script by passing a Python tuple as an argument and the overhead is around 19us with Cython enabled, while pybind is normally less than 400ns, and numpy comes with even much lower overhead. Did I make any mistakes in coding up the test script or it's something that can be addressed in TVM FFI? Thanks.
import timeit
setup = """
import tvm
#x = tvm.nd.array([0])
#y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop((1, 2, 3, 4))')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)
First of all, it would really be great if the issue of native data structure could been bought up at the first place in the RFC :)
Every design has a tradeoff, and most of the TVM FFI's design started as an artifact of of Amdahl's law. In particular, it is interesting to ask how much native python data structure we need. e.g. most of the fast path(e.g. add/sub) and common operators the problem does not involve tuple as an argument. As a layman example, I don't care if my save to json function takes 100us to finish(because that is not going to be bottleneck of my program).
When tuple and strings are involved, the typical operators (such as conv2d) are usually large. Moreover, when such operators are constructed in a two phase manner, or passed through an partially specialized annotator that translates the program, the cost of the constant tuple are only constructed once rather than in a loop (just like the following program).
class Module:
def __init__(self):
## constructed once
self.conv2d = conv2d(kernel=(2,2))
def forward(self, x):
# run multiple times
return self.conv2d(x)
# hybridize can transform the code below to the above form.
@hybridize
def myfunc(x):
y = conv2d(x, kernel=(2,2))
Of course, if we really say that "hey, we want the int tuple case to be fast", just like the case of copy in #14883 . We can introduce native tuple objects into the TVM FFI to improve the performance of the passing tuple as arguments.
The general design philosophy is we only need make the things that needs to be fast fast. The FFI can handle objects that it recognizes efficiently, while have a reasonable slow fallback for cases that are not necessarily the bottleneck.
As I said in the beginning of the post, the spectrum of "fast argument" should really be discussed in the RFC at the first place. It would be great if we can have a more constructive discussion about these use cases, then add technical reasoning, before reaching a verdict.
I think we need to support tuple
if we are going to adopt the PackedFunction approach. It is used in some functions like split
, hsplit
, reshape
, transpose
, a[(1, 3)]
.
@tqchen Thanks for explaining things inside out. Please know that I'm not against TVM FFI design. In fact, it's great to know that you think passing Python native data structures can be accelerated by engineering through TVM FFI. This is vital for keeping the future MXNet runtime API in a limited set for scalable and sustainable maintainability.
Putting the design decision aside, I want to share that there has been extremely strong motivation and need of squeezing out the latency of passing Python native data structures in op interface. Since MXNet is embracing NumPy compatibility, we want to get the op invocation performance on par with NumPy to be appealing to classic machine learning community. We have compared the performance between MXNet and NumPy using a bunch of classic machine learning models and found that optimizing passing Python native data structures is critical for MXNet to be on par with NumPy. Even for deep learning itself, this is also important in some applications. In https://github.com/apache/incubator-mxnet/pull/16716, reset_arrays
was corroborated to get rid of passing tuple objects too many times in mx.nd.zeros
so that the training performance can be increased by 5%.
With that being said, please allow me to summarize what we have reached here. I think we are aligned on exploring TVM FFI to have a clear engineering view of accelerating passing Python native data structures as arguments. We can start from tuples, and extend the findings to lists and strings later.
Thank everyone for a great discussion and sorry for the late responses since I have been on vacation this week. I didn't expect this task item of PoC to become a full-fledged RFC that has involved this many interested folks. I will be sure to make the post more descriptive and self-explanatory next time. Have a nice holiday! :)
To followup on this thread about brining native support to tvm ffi, I will first discuss the ways to address tuple, and then discuss some of the pros and cons.
First of all, we all know that at a time point we are going to translate the python data structure we know into C++. The main question is where that translation can happen. In the pybind case, the translation happens in the c++ side by directly passing pyobject to the c++. In the case of cython, the translation happens in the C API level. In the case of TVM FFI, the translation can happen at a python wrapping.
The following code gives two examples of such translation(myempty0, and myempty1), the support for myempty1
requires one additional commit in this branch. I also added a str argument as as str support was mentioned.
In the first approach(myempty0), we directly unpack the tuple as positional arguments, and encode the data structure as a flattened argument. In the second approach(myempty1), it first creates a IntTuple object that the C++ side can recognize and then pass it to the C++ side. Note that at the moment all the operations are done through python, if there is a concern in terms of wrapping, we can certainly bring some of them into cython.
We could introduce a third approach(myempty2): which directly passes a PyObject*
to the c++ side, then the developer can additional packages(e.g. pybind or related) to process the tuple. This third approach would achieve the same perf as pybind. However, there are some trade-offs, see discussion section.
import timeit
import tvm
nop = tvm._api_internal._nop
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop((1,2,1))')
timer.timeit(1)
num_repeat = 1000
print("tvm.nowrap:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
"""
timer = timeit.Timer(setup=setup,
stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)
def myempty0(shape):
return nop(*shape)
def myempty1(shape):
return nop(tvm.container.IntTuple(*shape))
setup = """
import numpy as np
import tvm
from __main__ import myempty0, myempty1
"""
timer = timeit.Timer(setup=setup,
stmt='myempty0((1,2,1))')
timer.timeit(1)
print("tvm.myempty0:", timer.timeit(num_repeat) / num_repeat)
timer = timeit.Timer(setup=setup,
stmt='myempty1((1,2,1))')
timer.timeit(1)
print("tvm.myempty1:", timer.timeit(num_repeat) / num_repeat)
setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)
Here are results on my computer:
$ TVM_FFI=cython python test.py
tvm.nowrap: 1.3379121999999994e-05
numpy.emmpty: 2.849200000000218e-07
tvm.myempty0: 1.7188400000001102e-07
tvm.myempty1: 9.54876000000021e-07
tvm.str_arg: 2.1269199999998656e-07
$ TVM_FFI=ctypes python test.py
tvm.nowrap: 5.141489199999999e-05
numpy.emmpty: 3.3954099999999874e-07
tvm.myempty0: 7.372958999999985e-06
tvm.myempty1: 1.3045717999999983e-05
tvm.str_arg: 5.962782e-06
As we can see, the myempty0
was quite fast. myempty1 was a bit slower due to the creation of the object(but still within the order of magnitude that we can use). Note that in the case of async exec and gradient tapping, we will need to create object to book-keeping the arguments, and myempty1 may not be a bad approach as the data structure is already created in the FFI level.
As explained in the beginning, the real question was where should the wrapping happen.
In the case of TVM, usually the wrapping happens at the native language(python) level, because we know there is a need of the python side wrapper for better code, type checking and docs. The translation forces the python arguments into arguments that can be recognized by the runtime.
In the case of pybind, the translation happens at the C++ level, by calling into the python C API(the myempty2 approach is similar to this one).
The advantage of exposing PyObject and related operations to the c++ level is certainly the deferred marshaling of data structures. On the other hand, such approach directly ties the FFI with python. It means other language frontends can no longer take benefit of the new FFI. On a similar direction, if we want to package some of the functions into a minimum runtime that is independent of python, we can no longer do that. This is why while in theory we could bring PyObject(or a related Proxy) to TVM runtime, we have not done so far.
Of course this is an interesting tradeoff, and everyone is welcomed to discuss their thoughts. Here is a summary of points that can be discussed:
Some additional thoughts along the line:
@reminisce Do you have data on what is the time split between the FFI itself and engine scheduling inside MXNet backend?
@tqchen For the "fast path" structures that we need to support, I'm considering the following:
import mxnet.numpy as mnp
import numpy as np
a = mnp.array(np.array(1))
b = a + np.array(1)
Also, there are two scenarios when list will be involved:
@sxjscience here are some quick thoughts (of course passing pyobject kind of "solves" the problem, so I am discussing the wrapping that can be done through the tvm ffi).
concat-> concat_internal(*args)
The py_slice is the most tricky case, my guess is that it could be accelerated through a cython layer that translate the slice into flattened representation, of course it is not too ideal and pybind maybe better for this case if we only want to handle it through c++.
@ptrendx I benchmarked mx.nd.zeros((1,), out=out)
in async engine. The total time including FFI and engine scheduling is 40us, and FFI itself takes 21us.
Ok, so it is about 50:50 split. Is there also work underway to profile what is the reason of the time spent in the engine?
@ptrendx Yes, there is an effort of profiling engine code flow using VTune. We hope the exercise can pinpoint the hotspots that contribute to the most part of latency. Further time split for pure C++ part between setup code (shape/type inference, memory allocation, dependency setup) and op scheduling is also around 50% vs. 50%.
For the "fast path" data structures, I'm summarizing the items as follows (including the ones suggested by @sxjscience):
tuple
and list
since they can be interchangeable in NumPy semantics to represent shapes and axes.str
because einsum has this parameter and the op can be intensively used in transformer models.py_slice
, Ellipsis
, None
for basic indexing. We can do one step further by moving the whole indexing dispatch logic to backend.mx.context.Context
. One call of mx.cpu()
can be as large as 600ns using ctypes. One thought is do it in the pybind way by creating a Python binding for the backend Context
class.np.dtype
. Similar to Context
.Thanks for discussions so far, to clarify the techinal questions and discuss tradeoffs further.
The following fast-path can be addressed in the TVM FFI:
tuple
, list
via translation in python/cython side (see benchmark above)str
is already fast (seem benchmark above)Context
can be quite fast if the object is a tvm object, around same mangitude as passing NDArraynp.dtype
can get around by by str conversion, or introduce a type structure, TVM FFI support DLDataType nativelyThe following items needs to be discussed
Of course, all of the above cases can be solved by pybind, or a mix of pybind and TVM FFI. It would certainly be interesting to discuss the possible engineering path.
The main techinque trade-off points that influences the decision are as follows:
After some thoughts along the direction, I find a better and fun answer to the above question to support tuple/ellipsis/slice in tvm ffi effectively.
I hacked up a POC in https://github.com/tqchen/tvm/tree/poc-pyffi (lastest commit) that supports the following benchmark script(disclaimer: it is only a POC so not intended for use or fully optimized, but it demonstrates all the technical flows necessary to make a fully functioning FFI).
import timeit
import tvm
nop = tvm._api_internal._nop
setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop((None,..., slice(0, 100, 2)))')
timer.timeit(1)
num_repeat = 1000
print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
"""
timer = timeit.Timer(setup=setup,
stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)
setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)
On my laptop(macbook 13inch), the results are as follows
$ TVM_FFI=cython python benchmark_ffi.py
tvm.tuple_slice_ellipsis_combo: 4.615739999999924e-07
numpy.emmpty: 2.7016599999998834e-07
tvm.str_arg: 2.3390799999997714e-07
In the POC, we introduced specific objects for Ellipsis, Slice and Tuple(already supported in ADT). During a PackedFunc call, a python tuple/ellipsis/slice was converted into the object that is supported by the backend. We implemented a cython version(the previous recursive conversion was in python) to back it up.
The reason that we are able to create Object in the cython side is because all TVM object has been recently converted to be POD-C compatible, so the object can be created in the cython side without crossing DLL boundary and passed to the c++ backend.
We can see from the benchmark that the cost of such deep-copy was at a reasonable level. We also only used the default memory allocator, so there could be space for further improvements.
Please also see tradeoff discussions in the last post. As we can see, the main difference here is where to do the conversion, and whether do we do lazy/deep copy:
The laziness certainly avoids a copy in cases where we do not necessarily need to book-keep the created argument. On the other hand, supporting a common data structure in the c++ side means the binding can potentially be reused by other language frontends.
Thank @tqchen for sharing the PoC code within such a short timeframe. :) The numbers look promising even with Python native objects deeply copied. Pybind performs deep copy by default unless the receiving object in C++ end is marked as opaque
so that the Python object passed by reference. That is often used for propagating large object changes from C++ to Python. In our op invocation use cases, there has been no such urgency of introducing this level of complexity so far since the Python objects are small and parameter passing is a one-way trip. The 300ns overhead should give us a good start to squeeze the total overhead into the 2us range. If there is really a need of passing PyObject
s in the future, we can always add that with a compile flag option. I think it's worth following this branch to integrate TVM FFI with MXNet op invocation flow to get more comprehensive benchmark results.
Following this branch I made a simple POC on MXNet side (code here). It turns out that passing a python tuple
and receiving it as TShape
takes around 200ns.
@hzfan thanks for implementing a poc:) However, these is a subtle but important difference which worth discussing in here :) I will use cython-ffi to refer to the above poc, and tvm-ffi to refer to tvm's poc
The difference again boils down to the design point of what is a clear cut of FFI conventions. Ideally, it would be: a stable set of C API and object structures that does not change over time.
What's the point of having an API if you type erase it? Then you might as well have a single function API with a type erased callback name to select the function to call. In the end you move the burden away from the API to the callers and inside the API to the dispatchers. For going this route of uber-clever template tricks to generate code, I think it's better to just put in place proper code generation for maintainability. Could you provide a bit more details about tradeoffs? Everything has tradeoffs, I don't believe any solution which is sold as a panacea, there's no silver bullet.
On Thu, Dec 19, 2019 at 10:21 AM Tianqi Chen notifications@github.com wrote:
I have another candidate that would highly recommend: adopt TVM's FFI convention.
The historical problem of MXNet FFI was the blowing amount of the C API bindings as we add new features. This creates a huge amount of maintenance burden.
The real problem was not really about which FFI system to adopt(cython and pybind are fine in that end, except for the cost of compilation), but more of the cost to maintain the FFI. MXNet used to have a fast cython binding, but that was abandoned because we keep add new APIs we cannot keep up both ctypes and cython.
When developing TVM we learnt from the lesson and restrict the API to a limited set of runtime APIs that does not change, and have a stable cython, ctypes binding for them. The runtime support a type-erased function(PackedFunc), which can be efficiently called from any of the frontend language, and all the APIs are exposed through the PackedFunc. On the python side an additional wrapping is created for better documentation and call into the PackedFunc. See more in https://docs.tvm.ai/dev/runtime.html The system works great for over a few years now.
Of course I understand there has been legacy issues in MXNet that is why I did not bring this proposal up. But given this is a proposal for 2.0, I would encourage everyone to give a serious thought about this possibility.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/17097?email_source=notifications&email_token=AAHCV2JHKIW6ZV2HVDYDR6DQZO3RLA5CNFSM4J3WHPZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHKPJXA#issuecomment-567604444, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCV2NICGEG3VMB2JTNHW3QZO3RLANCNFSM4J3WHPZQ .
Pybind is nice, I used Boost python many years ago, which I think is based on. The problem with this is the hourglass C bindings, you have to go from Python to C++ / Pybind, down to C and to the engine, this seems like a lot of boilerplate.
On Mon, Dec 16, 2019 at 10:02 PM reminisce notifications@github.com wrote:
MXNet imperative operator invocation overhead is as large as 30-60us, which is significant compared to the official NumPy operators with ~600ns overhead. This has negatively impacted the performance of applying MXNet to the models where many operators' kernel runtime duration is short, especially in the area of classic machine learning. We plan to address the problem in two steps:
1.
Short term: Use pybind11 to replace Python op API and ctypes/c api. Preliminary experiments show that the pure Python-C++ turnaround time by using Pybind is between 400-600ns, while the current Python op API using ctypes/c api costs more than 10us. We believe with the correct implementation, we can reduce the op invocation overhead to 2us including the time on FFI and engine. 2.
Long term: Adopt Python's C extension interface. NumPy did this by developing its own C API. This provides considerably less overhead compared to other solutions. However, it would cost much more engineering efforts by integrating this with our existing operator workflow in C++.
@hzfan https://github.com/hzfan @hgt312 https://github.com/hgt312
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-mxnet/issues/17097?email_source=notifications&email_token=AAHCV2KR5Z4L2T62UJCEVPTQZBTOZA5CNFSM4J3WHPZ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IA6JXSA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCV2J7KYUKKTFZTKN6TB3QZBTOZANCNFSM4J3WHPZQ .
@larroy indeed every solution has trade-offs, and these tradeoffs are discussed in the above posts when we compare solutions, and they are backed by benchmarks :) it would be great if you can also suggest potential tradeoffs here.
When you expose an API from typed language(c++) to a dynamic language(python), you have to type erase it, given that the python functions don't have the type, and you have to pass the information along.
The only difference is where you do the type checking(that the python type corresponds to the right c++ type), and translation(translating to the c++ type).
For example, in the case of pybind, the erasure is done implicitly when you call the python function, then checking and translation happens when you call into the c++ function.
In the case of creating a C API for each feature and wrap things in the python side, the type checking is done in the python side, and translation as well.
In the case of tvm ffi, the type translation is done in the python/cython side, while the type checking is done in the c++.
To dive deeper into the tradeoffs for PackedFunc calling convention. The convention erases the type by having the type code stored into the arguments. This brings additional cost of passing arguments into heap, as opposed to registers. So they might not be designed for inline functions that needs to happen at the order of 1e-9s, however, for API functions that needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
In terms of the calling cost, it really depends on whether the caller and callee are strongly typed.
As we can see, the only place where dispatching is necessary is the dynamic type handling case. Even in these cases, if there is a strong need of specialization, we can directly force the type by running checking on the caller, and pass in the right type code (the engineering burden is the same as wrapping the C API). However, the benchmark suggests that the dynamic dispatching cost is reasonable, and satisfies the API speed.
Coming back to the tradeoff, the main tradeoff here is the engineering burden to keep an hourglass design(with fixed set of API) vs efficiency. While my post did not suggest that TVM's ffi is a silver bullet, it does works pretty well for our use cases. hope it helps
Thanks for the explanation. I'm not so concerned about complexity of dispatching. If I understood you correctly the main benefit that you explain for the TVM project was not having to change the C API, but still you need to do type checking in both ends, or at least on the receiving end of the API, correct? I think we have discussed similar things in the past and we might have different views on strongly typed vs dynamic typed. A priori I prefer to see an API which can be evolved and changed, I find it more explicit and clearer that what I think you do with PackedFun which I have looked at briefly but not used extensively. If one is going to call into the C API using pybind, does it make sense to layer a C++ API on top of the C API for this?
Also these microbenchmarks are nice, but we also need to consider the overhead in typical workloads and see if it's still significant.
CFFI is also another alternative.
I couldn't access your pointers like:
https://github.com/tqchen/tvm/tree/pyffi
On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen notifications@github.com wrote:
@larroy indeed every solution has trade-offs, and these tradeoffs are discussed in the above posts when we compare solutions, and they are backed by benchmarks :) it would be great if you can also suggest potential tradeoffs here.
When you expose an API from typed language(c++) to a dynamic language(python), you have to type erase it, given that the python functions don't have the type, and you have to pass the information along.
The only difference is where you do the type checking(that the python type corresponds to the right c++ type), and translation(translating to the c++ type).
For example, in the case of pybind, the erasure is done implicitly when you call the python function, then checking and translation happens when you call into the c++ function.
In the case of creating a C API for each feature and wrap things in the python side, the type checking is done in the python side, and translation as well.
In the case of tvm ffi, the type translation is done in the python/cython side, while the type checking is done in the c++.
To dive deeper into the tradeoffs for PackedFunc calling convention. The convention erases the type by having the type code stored into the arguments. This brings additional cost of passing arguments into heap, as opposed to registers. So they might not be designed for inline functions that needs to happen at the order of 1e-9s, however, for API functions that needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
In terms of the calling cost, it really depends on whether the caller and callee are strongly typed.
- If caller is strongly typed, then assigning type code is O(1)
- If caller is a dynamic type(like python) then we need to have a dispatcher to dispatch and select the right type code
- If callee is strongly typed, then the cost of checking is O(1) by just check the code to be the correct one
- If the callee is dynamic type, then a dispatching need to happen, which have another level of hashtable lookup O(1)
As we can see, the only place where dispatching is necessary is the dynamic type handling case. Even in these cases, if there is a strong need of specialization, we can directly force the type by running checking on the caller, and pass in the right type code (the engineering burden is the same as wrapping the C API). However, the benchmark suggests that the dynamic dispatching cost is reasonable, and satisfies the API speed.
Coming back to the tradeoff, the main tradeoff here is the engineering burden to keep an hourglass design(with fixed set of API) vs efficiency. While my post did not suggest that TVM's ffi is a silver bullet, it does works pretty well for our use cases. hope it helps
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957
LOL, the last one was my comment, not @szha :-D
Test
On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy pedro.larroy.lists@gmail.com wrote:
Thanks for the explanation. I'm not so concerned about complexity of dispatching. If I understood you correctly the main benefit that you explain for the TVM project was not having to change the C API, but still you need to do type checking in both ends, or at least on the receiving end of the API, correct? I think we have discussed similar things in the past and we might have different views on strongly typed vs dynamic typed. A priori I prefer to see an API which can be evolved and changed, I find it more explicit and clearer that what I think you do with PackedFun which I have looked at briefly but not used extensively. If one is going to call into the C API using pybind, does it make sense to layer a C++ API on top of the C API for this?
Also these microbenchmarks are nice, but we also need to consider the overhead in typical workloads and see if it's still significant.
CFFI is also another alternative.
I couldn't access your pointers like:
https://github.com/tqchen/tvm/tree/pyffi
On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen notifications@github.com wrote:
@larroy indeed every solution has trade-offs, and these tradeoffs are discussed in the above posts when we compare solutions, and they are backed by benchmarks :) it would be great if you can also suggest potential tradeoffs here.
When you expose an API from typed language(c++) to a dynamic language(python), you have to type erase it, given that the python functions don't have the type, and you have to pass the information along.
The only difference is where you do the type checking(that the python type corresponds to the right c++ type), and translation(translating to the c++ type).
For example, in the case of pybind, the erasure is done implicitly when you call the python function, then checking and translation happens when you call into the c++ function.
In the case of creating a C API for each feature and wrap things in the python side, the type checking is done in the python side, and translation as well.
In the case of tvm ffi, the type translation is done in the python/cython side, while the type checking is done in the c++.
To dive deeper into the tradeoffs for PackedFunc calling convention. The convention erases the type by having the type code stored into the arguments. This brings additional cost of passing arguments into heap, as opposed to registers. So they might not be designed for inline functions that needs to happen at the order of 1e-9s, however, for API functions that needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
In terms of the calling cost, it really depends on whether the caller and callee are strongly typed.
- If caller is strongly typed, then assigning type code is O(1)
- If caller is a dynamic type(like python) then we need to have a dispatcher to dispatch and select the right type code
- If callee is strongly typed, then the cost of checking is O(1) by just check the code to be the correct one
- If the callee is dynamic type, then a dispatching need to happen, which have another level of hashtable lookup O(1)
As we can see, the only place where dispatching is necessary is the dynamic type handling case. Even in these cases, if there is a strong need of specialization, we can directly force the type by running checking on the caller, and pass in the right type code (the engineering burden is the same as wrapping the C API). However, the benchmark suggests that the dynamic dispatching cost is reasonable, and satisfies the API speed.
Coming back to the tradeoff, the main tradeoff here is the engineering burden to keep an hourglass design(with fixed set of API) vs efficiency. While my post did not suggest that TVM's ffi is a silver bullet, it does works pretty well for our use cases. hope it helps
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957
re the need for explicit type checking code in TVM FFI.
Actually there is no explicit code for type checking as they are generated automatically via template expansion(on the receiving end), also we also have a "strong typed" signature that wraps the packed function interface, which gives you compile time type checking https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/packed_func.h#L191
For dynamic language side(python) the exposed function is still type erased(as python is a dynamic language).
Note that the view dynamic vs static typed language does not really apply to this case, because the main goal(exposing to python) means type-erasure(as python is dynamically typed). The main goal would be how to reduce the number of abstraction layers.
Also these microbenchmarks are nice, but we also need to consider the overhead in typical workloads and see if it's still significant.
If we apply reasoning, most API cost is going to be FFI cost + exec cost, and I think the conclusion so far is we want FFI cost to be around 1e-7s to 1e-6s, which is the limit of any cost .
I created a follow-up design proposal on cwiki. TVM FFI works well with MXNet and the overhead for np.zeros
gets greatly reduced. Any feedback is appreciated.
Thanks @hzfan I would also high recommending taking a close look at the TVM's object protocol, and try to push most of the things through the Object eventually(Create temporary support for legacy cases like TShape is fine, but eventually pushing things as object will have a greater uniformity, and brings benefit such as putting everything into a container)
@tqchen Thanks for sharing this. I don’t know if I understand correctly. For now arguments except primitives pass through FFI via Object (like ADTObj). It is then converted to TShape in backend and TShape is not involved in FFI directly.
As you said, Object allows me to conveniently put various kinds of things into a container (ADTObj), without losing their types. For example, now a tuple of tuples like ((2, 2), (2, 2)) is allowed.
Also sorry for the late reply. I have been on a vacation this week :)
MXNet imperative operator invocation overhead is as large as 30-60us, which is significant compared to the official NumPy operators with ~600ns overhead. This has negatively impacted the performance of applying MXNet to the models where many operators' kernel runtime duration is short, especially in the area of classic machine learning. We plan to address the problem in two steps:
Short term: Use pybind11 to replace Python op API and ctypes/c api. Preliminary experiments show that the pure Python-C++ turnaround time by using Pybind is between 400-600ns, while the current Python op API using ctypes/c api costs more than 10us. We believe with the correct implementation, we can reduce the op invocation overhead to 2us including the time on FFI and engine.
Long term: Adopt Python's C extension interface. NumPy did this by developing its own C API. This provides considerably less overhead compared to other solutions. However, it would cost much more engineering efforts by integrating this with our existing operator workflow in C++.
@hzfan @hgt312