tqchen commented 5 years ago

As we move to the NNVMv2(relay), there is a clear separation of compiler and runtime. The compiler IR is maximumly flexible, while the runtime is a virtual machine(interpreter) that the executes the code which compiler generates.

This is an RFC to discuss what kinds of runtimes do we have. Given the separation of compiler and runtime, it might make sense to have multiple runtimes since there is a tradeoff on how rich the feature we want to support(e.g. JIT) vs the minimalism we need on an embedded device. Likely we will need an AOT runtime and JIT one.

There are also questions on what data structures we want to expose in the runtime. Likely the TVM runtime and PackedFunc is already our friend, but we may need a few more things to accommodate control-flow and other applications

ajtulloch commented 5 years ago

One thing that would be useful (that I think is orthogonal to JIT vs AOT) is considering relaxing the requirement on fully-specified shapes in graph_runtime. That is, we'd like to allow variable dimensions (consider supporting e.g. variable batch size) and allocate (at least some) memory dynamically.

tqchen commented 5 years ago

@ajtulloch can you give some specific application examples. For example, dynamic batch size in image modeling, dynamic length in language models etc. We do support certain forms of type inference with dynamic variables (the current type inference support symbolic integer in batch, so we can know a shape of the tensor is (n, 100, 100) ), and likely that would help in certain cases.

ajtulloch commented 5 years ago

@tqchen yes, a simple case would be supporting using a graph in graph_runtime with requiring fully static shapes. For example, say we have a ResNet trained on ImageNet-1k classes, we'd like to be able to pass an (N, 3, 224, 224) tensor and recieve an (N, 1000) tensor, where N can change between calls. This shouldn't require JIT, we can just partially shape-specialize when compiling.

antinucleon commented 5 years ago

I agree with @ajtulloch.

More dynamic is necessary for future application. We should list a few tasks for each field, including CV/NLP/RL/ASR/TTS etc, check one by one to see what kind of feature will be necessary and what is the trend. In this phrase domain experts' input will be very helpful.

sergei-mironov commented 5 years ago

For embedded applications it is often important not to use dynamic memory. For example, some tiny systems may miss memory manager at all (stubs in place of malloc/free in libc). In my opinion, user should always have an option to generate static-memory code or the code where upper bound for memory usage is well-defined.

@ajtulloch Is there a possibility to support varying shapes but still use static memory allocations? For example, could we generate static memory code, which would allow users to vary one dimension of one input tensor?

Edit correct text in last paragraph

ajtulloch commented 5 years ago

@grwlf I don't envision this would require dynamic memory allocations, more that it would make it possible. If you know all shapes statically then of course you can just statically allocate. This is more enabling new cases where you don't know all dimensions for all tensors statically.

ajtulloch commented 5 years ago

FWIW here are some thoughts on possible usages for the NNVMv2 runtime :

1) Simplest possible static interpreter - ideally all control logic would be compiled entirely into the model object code, and the 'interpreter' can simple statically allocate the requested memory (i.e. from StoragePlan), then just dlopen/dlsym the code and jump directly to the entry point. This would allow an incredibly minimal interpreter (O(10kb) of object code, since it's basically just doing malloc + dlopen + dlsym) which is very useful for certain code-size-constrained CV applications in embedded systems.

2) More complex interpreter with support for dynamic allocation. This you can imagine as something that's used underneath the existing APIs of various platform {mxnet,c2,tf}::Predictor objects (e.g. https://github.com/apache/incubator-mxnet/blob/master/scala-package/infer/src/main/scala/org/apache/mxnet/infer/Predictor.scala, https://github.com/apache/incubator-mxnet/blob/master/amalgamation/python/mxnet_predict.py, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/predictor, https://github.com/pytorch/pytorch/blob/master/caffe2/predictor/predictor.h, etc). This would probably need to support dynamic memory allocation (as e.g. input dimensions like batch sizes may be different from invocation to invocation), possibly JIT-specializing to different shapes at runtime, etc. Another possible usage model could be as a backend for e.g. https://github.com/pytorch/pytorch/tree/master/torch/csrc/jit/fuser which would introduce some more constraints.

szha commented 5 years ago

Is training in the scope of discussion?

icemelon commented 5 years ago

Another thing that could be useful in runtime is to support graph partition in case certain operators are not supported by accelerators or resources at runtime doesn't permit.

yidawang commented 5 years ago

I would list three most demanding features of our runtime: 1) handling inputs in dynamic shape, 2) handling execution on multiple compute resources (CPU, GPU, DSP, NPU, etc.), and 3) enabling operator-level parallelism.

I think having multiple runtimes makes sense as the scenarios of server and edge are vastly different. On the server it is relatively flexible due to the sufficient compute resource and less constraint of power consumption. For example, we can use JIT to handle dynamic shapes, execute sample runs to determine the resource availability beforehand.

The critical part is how to design runtime in the constraints of edge devices. I think a minimal static interpreter @ajtulloch suggested makes sense. However, an edge device may not have much space to store many pre-compiled object files.

junrushao commented 5 years ago

Several concerns: 1) Minimizing data movement by laziness. In a word, just dump data anywhere in global memory is fine, do not waste time concatenating them, reordering them or moving them around before they are actually needed in real computation. For example, in concat, can we avoid memory allocation, kernel launching, etc, which just move data to somewhere, even if there is no actual computation? We could maintain discontiguous memory chunks lazily, and only indexing them when real computation happens. Although memory access cost in this kernel is possibly increased, it does eliminate overall memory accesses. I guess this could be implemented by an extra pass. 2) Dynamic batching. If 1) is enabled, dynamic batching could be realized very easily. I do believe dynamic batching will be a nice feature so that NLPers could focus on the model itself rather than batching data, while get performance gain automatically. I think we have some usecases in industry as well @yidawang . 3) Fine-grained streaming. It is possible to stream data on CUDA GPU, but kind of complicated. We do hear about usecases when doing heterogeneous FPGA-CUDA acceleration, streaming row by row from GPU to FPGA helps a lot in reducing latency to nearly zero. 4) The ability to get rid of a runtime, i.e. pure AOT. Having a heavy runtime is repeating the mistakes of previous generation of deep learning framework. What's more, any extra dependency means extra plain on edge devices. I think it could be great to just dump a binary which allows us to call a subset of computation (which contains no control flow, only dataflow), and let users themselves assemble these binarized when containing branches, loops etc.

junrushao commented 5 years ago

I agree with @szha that we should have a scope of discussion. In this thread, should we talk about future integration into upstream deep learning framework (e.g. MXNet), should we talk about inference v.s. training, should we consider distributed stuff, and should we consider thread safety? These are all concerns from different scopes.

tqchen commented 5 years ago

To limit the scope of the discussion, let us first focus on the low resource and pure AOT scenario. The multiple target execution is already supported in the current graph runtime as #1695, and we only need to build compiler support for that. My guess is that jit and training is its own beast and would deserve another thread.

yidawang commented 5 years ago

1695 is great but doesn't cover all I meant by "handling execution on multiple compute resources". Given multiple targets, how to schedule the execution on them in parallel would be an interesting research/engineering topic to explore. Also, TVM runtime may call a third-party runtime (e.g. TensorRT) for a particular target in the heterogenous environment.

sergei-mironov commented 5 years ago

Is training in the scope of discussion?

@szha We would be glad to continue discussion related to training in another thread: https://github.com/dmlc/tvm/issues/1996 (please cc @sgrechanik-h)

junrushao commented 5 years ago

@grwlf This sounds pretty cool! So how could we do manual (or automatic) scheduling on a automatically generated backprop?

CC: @were seems to be interested in this as well.

sergei-mironov commented 5 years ago

@grwlf I don't envision this would require dynamic memory allocations, more that it would make it possible. If you know all shapes statically then of course you can just statically allocate.

@ajtulloch I meant the case where we have to allocate memory statically, but still want to vary one of input dimensions. Having this feature implemented, TVM may gain positions in the domain of resource-critical embedded applications. If I am correct, in dynamic batching, as mentioned by @junrushao1994, we typically vary a batch size dimension, probably it is a good example of such case.

nhynes commented 5 years ago

+1 to assigning graph partitions to threads instead of axes. There are some convincing benchmarks and discussions which suggest that cache locality is a primary performance booster.

tqchen commented 5 years ago

Normally if workload is big enough, our past experience in MXNet suggest that parallelization within op has better potential than bind parts to threads. The pipeline partition would be useful though on small ops

junrushao commented 5 years ago

In inference, graph-level parallelism does not help that much (at least on CPU/GPU), because normally an operator has been big enough to occupy all CPU threads or GPU streaming processors. About small workloads, @tqchen @nhynes do you have any specific example of such workloads, i.e. that could be accelerated by multiple issuing, but couldn't be fused into a single kernel?

junrushao commented 5 years ago

Glow is a framework that does not support multi-threading (at least at the time of writing this post), i.e. all their operators are single-threaded. This could somehow explain why multiple issuing helps in Glow (imho).

nhynes commented 5 years ago

Glow [...] does not support multi-threading

Right, but that's what they're planning on doing now. They're going the graph partitioning route and have expressed that cache locality makes this approach more efficient than op-parallelism.

I can't quite think of an example of a "small" op in the world of fusion, though.

junrushao commented 5 years ago

Right, but that's what they're planning on doing now. They're going the graph partitioning route and have expressed that cache locality makes this approach more efficient than op-parallelism.

Looks very interesting. @were do you have some bandwidth looking into this?

tqchen commented 5 years ago

related #2810

apache / tvm

[RFC] Discuss New Features of AOT Runtime #2122