Closed tqchen closed 5 years ago
One thing that would be useful (that I think is orthogonal to JIT vs AOT) is considering relaxing the requirement on fully-specified shapes in graph_runtime. That is, we'd like to allow variable dimensions (consider supporting e.g. variable batch size) and allocate (at least some) memory dynamically.
@ajtulloch can you give some specific application examples. For example, dynamic batch size in image modeling, dynamic length in language models etc. We do support certain forms of type inference with dynamic variables (the current type inference support symbolic integer in batch, so we can know a shape of the tensor is (n, 100, 100) ), and likely that would help in certain cases.
@tqchen yes, a simple case would be supporting using a graph in graph_runtime
with requiring fully static shapes. For example, say we have a ResNet trained on ImageNet-1k classes, we'd like to be able to pass an (N, 3, 224, 224) tensor and recieve an (N, 1000) tensor, where N can change between calls. This shouldn't require JIT, we can just partially shape-specialize when compiling.
I agree with @ajtulloch.
More dynamic is necessary for future application. We should list a few tasks for each field, including CV/NLP/RL/ASR/TTS etc, check one by one to see what kind of feature will be necessary and what is the trend. In this phrase domain experts' input will be very helpful.
For embedded applications it is often important not to use dynamic memory. For example, some tiny systems may miss memory manager at all (stubs in place of malloc/free in libc). In my opinion, user should always have an option to generate static-memory code or the code where upper bound for memory usage is well-defined.
@ajtulloch Is there a possibility to support varying shapes but still use static memory allocations? For example, could we generate static memory code, which would allow users to vary one dimension of one input tensor?
Edit correct text in last paragraph
@grwlf I don't envision this would require dynamic memory allocations, more that it would make it possible. If you know all shapes statically then of course you can just statically allocate. This is more enabling new cases where you don't know all dimensions for all tensors statically.
FWIW here are some thoughts on possible usages for the NNVMv2 runtime :
1) Simplest possible static interpreter - ideally all control logic would be compiled entirely into the model object code, and the 'interpreter' can simple statically allocate the requested memory (i.e. from StoragePlan), then just dlopen/dlsym the code and jump directly to the entry point. This would allow an incredibly minimal interpreter (O(10kb) of object code, since it's basically just doing malloc + dlopen + dlsym) which is very useful for certain code-size-constrained CV applications in embedded systems.
2) More complex interpreter with support for dynamic allocation. This you can imagine as something that's used underneath the existing APIs of various platform {mxnet,c2,tf}::Predictor
objects (e.g. https://github.com/apache/incubator-mxnet/blob/master/scala-package/infer/src/main/scala/org/apache/mxnet/infer/Predictor.scala, https://github.com/apache/incubator-mxnet/blob/master/amalgamation/python/mxnet_predict.py, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/predictor, https://github.com/pytorch/pytorch/blob/master/caffe2/predictor/predictor.h, etc). This would probably need to support dynamic memory allocation (as e.g. input dimensions like batch sizes may be different from invocation to invocation), possibly JIT-specializing to different shapes at runtime, etc. Another possible usage model could be as a backend for e.g. https://github.com/pytorch/pytorch/tree/master/torch/csrc/jit/fuser which would introduce some more constraints.
Is training in the scope of discussion?
Another thing that could be useful in runtime is to support graph partition in case certain operators are not supported by accelerators or resources at runtime doesn't permit.
I would list three most demanding features of our runtime: 1) handling inputs in dynamic shape, 2) handling execution on multiple compute resources (CPU, GPU, DSP, NPU, etc.), and 3) enabling operator-level parallelism.
I think having multiple runtimes makes sense as the scenarios of server and edge are vastly different. On the server it is relatively flexible due to the sufficient compute resource and less constraint of power consumption. For example, we can use JIT to handle dynamic shapes, execute sample runs to determine the resource availability beforehand.
The critical part is how to design runtime in the constraints of edge devices. I think a minimal static interpreter @ajtulloch suggested makes sense. However, an edge device may not have much space to store many pre-compiled object files.
Several concerns:
1) Minimizing data movement by laziness. In a word, just dump data anywhere in global memory is fine, do not waste time concatenating them, reordering them or moving them around before they are actually needed in real computation. For example, in concat
, can we avoid memory allocation, kernel launching, etc, which just move data to somewhere, even if there is no actual computation? We could maintain discontiguous memory chunks lazily, and only indexing them when real computation happens. Although memory access cost in this kernel is possibly increased, it does eliminate overall memory accesses. I guess this could be implemented by an extra pass.
2) Dynamic batching. If 1) is enabled, dynamic batching could be realized very easily. I do believe dynamic batching will be a nice feature so that NLPers could focus on the model itself rather than batching data, while get performance gain automatically. I think we have some usecases in industry as well @yidawang .
3) Fine-grained streaming. It is possible to stream data on CUDA GPU, but kind of complicated. We do hear about usecases when doing heterogeneous FPGA-CUDA acceleration, streaming row by row from GPU to FPGA helps a lot in reducing latency to nearly zero.
4) The ability to get rid of a runtime, i.e. pure AOT. Having a heavy runtime is repeating the mistakes of previous generation of deep learning framework. What's more, any extra dependency means extra plain on edge devices. I think it could be great to just dump a binary which allows us to call a subset of computation (which contains no control flow, only dataflow), and let users themselves assemble these binarized when containing branches, loops etc.
I agree with @szha that we should have a scope of discussion. In this thread, should we talk about future integration into upstream deep learning framework (e.g. MXNet), should we talk about inference v.s. training, should we consider distributed stuff, and should we consider thread safety? These are all concerns from different scopes.
To limit the scope of the discussion, let us first focus on the low resource and pure AOT scenario. The multiple target execution is already supported in the current graph runtime as #1695, and we only need to build compiler support for that. My guess is that jit and training is its own beast and would deserve another thread.
Is training in the scope of discussion?
@szha We would be glad to continue discussion related to training in another thread: https://github.com/dmlc/tvm/issues/1996 (please cc @sgrechanik-h)
@grwlf This sounds pretty cool! So how could we do manual (or automatic) scheduling on a automatically generated backprop?
CC: @were seems to be interested in this as well.
@grwlf I don't envision this would require dynamic memory allocations, more that it would make it possible. If you know all shapes statically then of course you can just statically allocate.
@ajtulloch I meant the case where we have to allocate memory statically, but still want to vary one of input dimensions. Having this feature implemented, TVM may gain positions in the domain of resource-critical embedded applications. If I am correct, in dynamic batching, as mentioned by @junrushao1994, we typically vary a batch size dimension, probably it is a good example of such case.
+1 to assigning graph partitions to threads instead of axes. There are some convincing benchmarks and discussions which suggest that cache locality is a primary performance booster.
Normally if workload is big enough, our past experience in MXNet suggest that parallelization within op has better potential than bind parts to threads. The pipeline partition would be useful though on small ops
In inference, graph-level parallelism does not help that much (at least on CPU/GPU), because normally an operator has been big enough to occupy all CPU threads or GPU streaming processors. About small workloads, @tqchen @nhynes do you have any specific example of such workloads, i.e. that could be accelerated by multiple issuing, but couldn't be fused into a single kernel?
Glow is a framework that does not support multi-threading (at least at the time of writing this post), i.e. all their operators are single-threaded. This could somehow explain why multiple issuing helps in Glow (imho).
Glow [...] does not support multi-threading
Right, but that's what they're planning on doing now. They're going the graph partitioning route and have expressed that cache locality makes this approach more efficient than op-parallelism.
I can't quite think of an example of a "small" op in the world of fusion, though.
Right, but that's what they're planning on doing now. They're going the graph partitioning route and have expressed that cache locality makes this approach more efficient than op-parallelism.
Looks very interesting. @were do you have some bandwidth looking into this?
related #2810
As we move to the NNVMv2(relay), there is a clear separation of compiler and runtime. The compiler IR is maximumly flexible, while the runtime is a virtual machine(interpreter) that the executes the code which compiler generates.
This is an RFC to discuss what kinds of runtimes do we have. Given the separation of compiler and runtime, it might make sense to have multiple runtimes since there is a tradeoff on how rich the feature we want to support(e.g. JIT) vs the minimalism we need on an embedded device. Likely we will need an AOT runtime and JIT one.
There are also questions on what data structures we want to expose in the runtime. Likely the TVM runtime and PackedFunc is already our friend, but we may need a few more things to accommodate control-flow and other applications