apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.58k stars 3.43k forks source link

[RFC] Relay Dynamic Runtime #2810

Closed jroesch closed 5 years ago

jroesch commented 5 years ago

Putting the VM in TVM: The Relay Virtual Machine

We have previously introduced Relay, a new program representation, which is able to represent and optimize a greater breadth of machine learning programs. Unfortunately by supporting a more expressive set of programs we introduced several new execution challenges. So far we have not provided a production-ready solution to executing full Relay programs, i.e those containing control-flow, abstraction (both data and functional), as well as dynamism.

We introduced a “debug” interpreter which performs naive AST traversal to execute the program. This approach is conceptually simple but requires traversal of the program for each evaluation. The program is stored as a tree which makes heavy use of indirection and leads to inefficient execution. There are still unknown dynamism issues such as dynamic scheduling and allocation, fully dynamic tensor shapes, and control-flow. The interpreter has simple solutions for these, but none provide a compelling and optimized solution.

The second execution mechanism is the existing graph runtime, in order to target Relay programs to this we translate a small subset of them to the old graph format, and execute them on the runtime. This provides a solid execution experience but only for a very limited subset of Relay programs.

Finally we have developed an experimental ahead-of-time compiler, which transforms a Relay program into a shared library which implements it. This final approach provides compelling performance but is hard to extend with new approaches to handling dynamic programs.

This RFC proposes a new virtual machine for Relay programs. The virtual machine is designed to strike a balance between performance and flexibility when deploying and executing Relay programs, without giving up the benefits of TVM.

Virtual machine (VM) design is a well studied area in programming languages and systems, and there have been various virtual machine designs for both full fledged, and embedded programing languages. Previous language VM designs have been heavily tailored to the execution profile of traditional programs. Traditional programs manipulate small scalar values and consist of a large number of low level instructions. The sheer quantity of instructions to compute requires instruction execution and dispatch to be extremely efficient.

In the context of machine learning we manipulate primarily tensor values, using a (relatively) low number of high level instructions. ML program's cost centers are expensive operator invocations such as GEMM or convolution, over a large input. Due to execution profile exhibited by ML programs micro-optimizations present in scalar-VMs are dramatically less important. A model’s runtime will be dominated by executing expensive operators on large inputs.

TVM has provided a strong support for vision models, but we want to grow to support a wider variety of models. The graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

The rest of this design document is focused on explaining a VM which addresses these challenges and and explores design decisions which remain.

Proposed Design

I have been experimenting with different designs and discussing how to solve this problem with members of the community for the past few months.

Our belief is the most important design aspects will be optimizing for cheap “allocation” of objects (by trying to avoid real allocation, reuse of static fragments, and the ability to do dynamic (i.e jagged tensors).

Instruction Set

The critical design choice of a VM is the instruction set and their representation. The current representation of the instructions is a tagged union, containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions, and how they take their data, that is RISC vs. CISC and fixed-width instruction encoding vs. variable length. The current version is closer to CISC with complex instructions like AllocTensor, and is variable length due to the inclusion of the shape as part of the instruction. The current instruction set is very high level and corresponds roughly to high level operations in Relay.

Push

Arguments: size_t stack_index

Reads the value at base pointer + stack_index and pushes it on to the stack.

Pop

Arguments: size_t pop_count

Pops pop_count number of entries off the stack starting from the end.

Ret

Arguments: None

Returns from the current function call, popping off the last value of the frame stack and restoring the VM state that was recorded at the last call site. The last value on the stack is interpreted as the return value.

InvokePacked

Arguments: size_t packed_index size_t arity size_t output_size

Invoke the packed function denoted by packed_index. The arity and output size are used to inform the VM how many inputs and outputs to expect.

AllocTensor

Arguments: const std::vector& shape, DLDataType dtype);

Allocate a tensor value of the appropriate shape and dtype.

AllocDatatype

Arguments: (size_t tag, size_t num_fields);

Allocate a data type with the tag tag using the top num_fields entries on the stack as its fields.

AllocClosure

Arguments:

Allocate a closure with the VMFunction at func_index as its code, and the num_freevar entries on the stack as its free variables.

GetField

Arguments:

If

Arguments:

Check if the top element on the stack is true or false. If true relative jump by true_branch, else relative jump by false_branch.

Goto

Arguments:

Relative unconditional jump by pc_offset.

Invoke

Arguments:

Invoke function at func_index, consumes the number of arguments contained in the VMFunction's arity field, and places return value on stack as top element.

InvokeClosure

Arguments: None

Expects the top value on the stack is a closure. Invokes closure consuming the number of arguments declared in the closure_index's VMFunction, and places return value on the stack.

LoadConst

Arguments:

Load the constant at const_index from the constant pool.

Object Representation

We use a simple object representation that uses shared pointers and tagging. There is a huge space of object representations we can trade off here, but we believe micro-optimizing this code has little to no-effect on the end-to-end performance.

struct VMObjectCell {
  VMObjectTag tag;
  ...
};

struct VMObject {
  std::shared_ptr<VMObjectCell> ptr;
  ... 
}

See vm.h for more details.

Currently we support 3 types of objects, tensors, data types, and closures.

VMObject VMTensor(const tvm::runtime::NDArray& data);
VMObject VMDatatype(size_t tag, const std::vector<VMObject>& fields);
VMObject VMClosure(size_t func_index, std::vector<VMObject> free_vars);

Stack and State

The Relay VM consists of two important stacks, the value stack which acts as the normal call stack as used in C/C++/Java/etc, and the frame stack, which contains information about how to resume the previous call.

A stack machine is straightforward to implement, register-based virtual machines are more efficient in the scalar-world, but we believe in the tensor world that the complexity to performance gain tradeoff is not worth it.

We keep track of a set of Relay functions we have called, a pointer into its bytecode, a offset into the byte code known as the program counter, as well as an offset into the value stack which tells us where the stack frame begins known as the base pointer.

struct VirtualMachine {
    ...
    std::vector<VMFrame> frames;
    std::vector<VMObject> stack;
    ... 
    // Current function.
    size_t func_index;
    // Pointer into the current function's instructions.
    const Instruction* code;
    // Current program counter relative to the code pointer.
    size_t pc;
    // The current base pointer.
    size_t bp;
    ... 
};

Dispatch Loop

A very critical piece of a VM is the dispatch loop, usually this dominates execution time of a virtual machine, but experimentally we have found the performance of the loop to not be of much importance. We have just implemented a simple switch/goto dispatch loop which dispatches based on instruction op code.

This loop is implemented by VirtualMachine::Run().

It is my belief that this code is not as important to end-to-end performance as allocation, and memory reuse.

VM Compiler

An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode. The VM compiler transforms a tvm::relay::Module into a tvm::relay::vm::VirtualMachine. The virtual machine contains a set of compiled functions, the compiled functions are contained in tvm::relay::vm::Function. The functions contain metadata about the the function as well as its compiled bytecode. For full definition of the data structures see vm.h.

Optimizations

There are quite a few optimizations required by the VM compiler.

We have implemented them in the old pass style, but plan to port them to the new pass manager (#2546) before merging.

Serialization

A final and yet to be implemented part of the VM design is serialization. This accompanying PR will introduce both the bytecode, its serialization, as well as VM level serialization. The idea being that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization.

Unresolved Questions

How do we handle dynamic shapes?

I have another prototype extension to Relay which adds initial support for compiling and executing programs containing fully dynamic shapes. I will post an RFC and prototype PR on this subject soon.

How can we modify the VM to support JIT compilation of certain code paths?

In the code generation space there are still many tradeoffs to be analyzed and the VM is designed to be very flexible so we can modify it for future experiments.

How do we support heterogenous execution?

Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies. In order to do this properly we need to run the device annotation and copying passes. We for see nothing too complex in this work.

jroesch commented 5 years ago

cc @tqchen @icemelon9 @wweic @zhiics @MarisaKirisame @junrushao1994 @abergeron @ajtulloch @antinucleon

Ravenwater commented 5 years ago

I am not a big fan of stack machines because it creates a self-inflicted serialization constraint on the operands. So, why shoot yourself in the foot before you begin? I would rather see a large register abstraction for tensor operands, so that concurrent execution is readily analyzable as read/write hazards and reordering are directly visible to the scheduler.

Stack machines are unpleasant to do in hardware, and history shows you that all hw engines for stack machines first unpack the stack and cast them back into a register-based resource structures before they start executing.

My vote is against a stack-based VM.

junrushao commented 5 years ago

@Ravenwater What else do you propose if you don’t like stack-based VM?

tqchen commented 5 years ago

Good discussions so far. I think we want to de-couple the RFC into two components:

I would recommend we focus on discussing and agreeing on the item (1) since this is the most important thing in the short term. Having a stack-vm as a starting point is fine due to its simplicity as long as we have some good thoughts on scheduling.

@jroesch would also be nice if you can separate the RFC into the three perspectives. For example, all stack-un-related instruction can go to (1)

icemelon commented 5 years ago

@tqchen It's not trivial to separate instruction to stack-unrelated and stack-related because it affects the calling convention. For example, in a stack vm, invoke instruction assumes all the input tensors and return tensors are already pushed on the stack.

But I agree that we should have a debate on whether to go with stack vm vs register vm first.

Ravenwater commented 5 years ago

@jroesch @junrushao1994 To clarify the statement: don't use a stack to manage operands to work on. The stack manipulations to order operands and results for the instruction set is the culprit: that creates a serialization that has nothing to do with the algorithm to be executed, but everything to do with the resource constraints of the stack operators and virtual instruction set. To extract concurrency, you need to 'unroll' all these stack operations to remove the artificial stack-machine created read/write hazards.

Ravenwater commented 5 years ago

@icemelon9 my vote is to take the operand slots in the stack frame and map them to an abstract register file and operate on them with a register-based instruction set. This will make it easier to map the execution onto a traditional stored program machine with registers, direct execution pipelines, systolic arrays, and distributed data flow engines.

tqchen commented 5 years ago

I meant that while the calling convention(whether arguments should be on stack) could differ in register vs stack-based vm. Anyway, we can also see if we can get a concrete proposal of a more "register-based" version. Regardless of which way we take, we still need a stack frame to handle functions calls and need to think about tail recursion under each setting.

Ravenwater commented 5 years ago

Interpretable-SSA-virtual-machine.pdf

ajtulloch commented 5 years ago

At a glance, I think this approach will work very well. It's very similar to what is used in libraries like PyTorch JIT, which try to solve a very similar problem, e.g. https://github.com/pytorch/pytorch/blob/47bf30661fb45f106e2b609495d82af463513b99/torch/csrc/jit/interpreter.cpp#L700-L769, etc.

zhiics commented 5 years ago

@Ravenwater I understand the difference between stack vs register based VM. But could you please be more specific about scheduling and parallelism in this context? I might miss something here. IMO, I think the most parallelism we could get should be from they way that we invoke the packed primitive functions, but that is oaky for the current design.

Ravenwater commented 5 years ago

@zhiics A stack-based ISA defines the operands of its instructions relative to the position of top of stack. This simplifies the ISA because you don't have the addressing space/assignment of a register file. However, you also lose the ability to exploit concurrency between operators, as they are all serialize through the top two three positions of the stack.

When you think of many computational graphs they tend to be trees of 'stuff', where the leave operands are independent. All that represents opportunities for the hardware to execute operators concurrently.

If you look at what a JVM JIT needs to do, it basically executes the stack program, extract the operands, assign them to registers and issue the associated instruction. To extract any concurrency, the JIT needs to unroll the stack instructions, which are a serialization constraint created purely by the VM: they are not present in the computational graph of the program.

kevinthesun commented 5 years ago

@Ravenwater Do you mean the current design has difficulty of handling operator level parallel execution, which is important for networks such as RNN?

tqchen commented 5 years ago

Some followup comments:

First of all, StackVM implementation do not necessarily ties into serial execution (see my comment on the separation of scheduling and vm). Most of the deep learning framework has extra asynchronous scheduling engine, that executes these instructions asynchronously based on dependencies. This essentially a "unrolling" approach. In the case of traditional vm, because the operational unit is scalar, performing such kind of execution can cause performance regression. In our case, such cost can be mitigated because each operation is tensor ops.

Regardless of stack vs register based implementation, we always need to have a call stack for function calls, and because we need tail recursion optimizations in our case, perhaps stack vm's abstraction makes it relatively easy to do so.

One important benefit of the stack vm is that it is simple to implement and extend, such simplicity could out-weights a lot of things.

Finally, since our high-level program is always relay, we do not have to perform all the optimizations like parallelization detection and distributed mapping at the level of the VM. Relay by itself is much closer to a computational graph form, and can be used for such high level analysis and transformation.

wweic commented 5 years ago

I think we need to be specific about what are the workloads we would like to schedule for Relay and design for that. From my experience looking at the opcodes generated by the Proof-of-Concept Relay VM, the most heavy lifting opcode is InvokePacked, which invokes a packed function that does some extensive tensor computation on some device. Majority of the remaining opcodes are for control flows, object allocation and deallocation etc, which I don't think would benefit much from concurrent execution. What we really need is probably to hook an async execution engine with InvokePacked as @tqchen mentioned.

tqchen commented 5 years ago

To be clear, we do want to think about how concurrency can be handled in each type of implementation. Concurrent execution does not have to be implemented now but should be considered in design scaffolding.

My last comment seems to mainly favor stack VM, given that there is also some amount of interest in register-based VM implementation. I would recommend that we also have an alternative design proposal for register based VM.

In particular, pytorch's JIT seems to use a resizable register that avoids spilling(which makes sense) and might also make things simpler to implement.

So regardless of which type of vm to implement. Let us focus on how can we handle generic control-flows and loops effectively (which is the main reason why VM is introduced instead of using graph executor). In particular:

It would be really helpful for all of us to discuss these aspects in the context of both(register and stack) VMs

FrozenGene commented 5 years ago

I ever done one compiler translation from register-based to stack-based and also from stack-based to register-based too. They are differ in some ways, but I must to say, according to my experience, I don't think it is the most important factor we should care for this RFC. And for TVM, not everything is done in this VM (if I don't misunderstand) and we also have other schedule engine.

For me, this is generally good and I care one thing: how to handle memory allocation efficiently? For dynamic, if we allocate every time, we lost all advantages.

Another thing I care: How to we support other inference engine (like tensorrt / NPU framework) if we have VM? I know maybe it is somehow go against to TVM's design.

Ever, we could do like this:

                     FE
                     |
              NNVM / Relay
               |       |
       NNVM graph   NPU-supported op graph
                  |
                 TVM  ---->   TVM Runtime (+ libTensorRT | libNPUInferEngine)

How could we handle this situation after we have VM? Many NPU Infer Engine / Tensort only expose high-level interface (at the level of framework, such as addConvolution, addRelu, createInfereEngine and so on).

Ravenwater commented 5 years ago

For (parallel) hardware accelerators, and in particular, parallel tensor processors, the hardware will have to do most of the scheduling. Memory allocation constraints are mostly for sequential stored program machines. For distributed data flow and PIM architectures the allocation must occur during execution typically just-in-time underneath the execution wavefront as it emerges in the machine.

In the Mathematics of Arrays, this resource allocation is provable optimal, and like any scheduling algorithm, the more information you have about the graph the better job you can do. Looking at the trend forward, the scheduler will disappear completely in the hardware as that is where the queuing and resource constraints are known.

It is with this simple observation that stack-based ISAs are unpleasant as the information the hw scheduler can see, is minimal. It will need to execute the actual program to get to a representation that can feed into an optimizing scheduler for tensor compositions.

The ultimate NPU will control and manage the stack frames as well, so it can take advantage of cross function call boundary concurrency. The reason that this becomes essential for parallel hardware is that when the concurrency grows filling and draining the pipeline (think 1M+ ALUs) becomes a major limit on the execution efficiency of the machine. The hw scheduler needs to have all the information to avoid draining the execution resources.

A lot of the discussion appears to be governed by how to schedule for a sequential machine. But DL is one of those applications where the sequential machine is irrelevant as the computational demands require extreme concurrency. I totally agree with @wweic that organizing this discussion by the demands for specific models is going to be a worthwhile experiment.

icemelon commented 5 years ago

Though most of parallelism happens in the packed function, it is still possible to execute two packed functions in parallel especially for inference in the cloud. However, to some extent, stack-based vm enforces certain orders when performing stack instructions. It potentially makes it harder to design an async executor.

One important goal for vm design and hardware-independent instructions is to provide more flexibility to write different executors given different requirements. My feeling is that converting from a register vm to a stack vm is easier than vice versa. @FrozenGene you can chime in since you've done both.

In terms of optimization, such as tail recursion, memory allocation, live analysis mentioned by @tqchen, it could be othorgonal to whether it's a register vm or stack vm. I believe you can write optimization passes for both.

Ravenwater commented 5 years ago

@icemelon9 and to continue speaking in generalities to create simple dichotomies, recursion is bad for building parallel hardware execution engines, again due to the sequentialization semantics of the recursion.

For good scheduling of tensor compositions, you want the whole composition expression so that all the expansions (Kronecker products etc.) and contractions (dot, matvec, matmul) are visible.

jroesch commented 5 years ago

My personal opinion is we need to abandon modeling VMs as high-level versions of an ISA, and its implementation as micro-architecture. We are in a fundamentally different world for executing Relay programs. The VM is only the control plane, the data plane is the the computation contained in highly optimized kernels produced by TVM.

We no longer compute over scalars the lessons learned from building scalar VMs and ISAs don't really apply.

The program of question is a high level data-flow program where an order has been selected. One simple implementation of this instruction set changes each function call to place a future on the stack, which will eventually be completed by executing the function in parallel.

This design enables asynchronous and out of order execution while being completely opaque to the VM's compiler and end-users. If we want to enable higher level forms of speculation and parallelism we can borrow ideas from parallel functional languages to change the program before scheduling on the VM (see https://simonmar.github.io/bib/papers/monad-par.pdf).

Simon has quite a few different designs in place and has used some of them to build a highly concurrent system now deployed at Facebook, the core of the library is here (https://github.com/facebook/Haxl).

On the topic of recursion, iteration and recursion are isomorphic. There is no fundamental difference, any tail-recursive function can be converted to a loop and vice-versa. In this case we perform tail-call optimization which removes the recursive call leaving only a goto instruction.

If we wanted to perform instruction level scheduling this would be no worse than the code generated by a loop, but maintains the high level properties enjoyed by Relay today.

One feature that may be more challenging is speculative execution. Speculative execution at the instruction level isn't particularly interesting in my opinion. The only meaningful speculation is at control boundaries such as loops or conditions such as TensorFlow does (https://www.tensorflow.org/api_docs/python/tf/while_loop, see parallel iterations).

If we want to provide non-strict semantics we should just transform the source program and use a simple mechanism like the one exposed in the Par paper described above.

A final argument is that the current graph-runtime is effectively a stack, with each node having pointers into it, the only difference is it can not grow and/or shrink.

The most important concern in my opinion is memory optimization. The current prototype has a new allocator API which we intend to leverage. We can perform dynamic allocation with a specialized allocation cache which will avoid repeatedly allocating memory from the system allocator.

The next step is to perform an optimization where we group allocations into blocks, and then request a dynamic block of the correct size. In this case code like loops which release, and re-request an identical size block each iteration will end up reusing the same block.

A final optimization is applying a static analysis like the current memory planner where we recover static allocation sizes, note that we can no longer do this in all cases due to dynamic behavior (loops) and when we introduce dynamic tensors the shapes will not be known until runtime.

Ravenwater commented 5 years ago

@jroesch that is a good argument, and I can get behind that. I think I can see a path from the proposed VM structure to our domain flow world. (Domain flow is distributed data flow)

I do want to point out that in a distributed data flow environment there are no goto's as there isn't a globally addressable resource. Similarly, memory allocation transforms to resource reservation guarantees that occur throughout a distributed machine and become implicit during execution. To implement recursion in a domain flow architecture requires complete unrolling of the program graph and mapping it on a physical network. So, recursion on a random access machine may be cute, on a domain flow machine it is still an open research problem. We'll have to punt on any recursive algorithms, or depend on a rewrite.

jroesch commented 5 years ago

I think it is important to remember this doesn't have to be the only way to execute Relay programs. Users are free to design and implement others. It is hard to build a single abstraction that maps well to all possible hardware platforms.

Relay is still relatively high level and can be transformed into an IR subset free of a device's particular challenges, i.e by performing loop unrolling (which is also possible with recursion).

A key design goal here is to provide a generic VM that satisfies common general purpose devices (i.e x86, ARM, GPU), and integrates with a class of accelerator designs. For example TPU-like devices can easily be integrated into this design as packed functions which offload a subset of the network to the device.

There are classes of models which are fundamentally recursive, and will not map to hardware devices which can not support dynamic, unbounded iteration/recursion. Many accelerators will never support executing entire models of this class, and heterogenous execution is required.

A goal of Relay is to optimize and execute arbitrary models without restricting the program representation or runtime system by the semantics of the minimally expressive device. Devices with limited semantics will be unable to execute some programs and there is no way around that.

If your architecture can map dataflow style token passing to your device, i.e TensorFlow loops, then by definition you can execute a large set of recursive programs just by performing a program transform. This transformation is the inverse of the recent TF loop decompilation, and the transformation from Relay to TF is far easier.

icemelon commented 5 years ago

@jroesch I agree with your argument that the VM design should be generic and allow different ways to implement an executor. But I have a different perspective on stack VM vs register VM. I think register VM is more general than stack VM in this sense.

For example, as @Ravenwater mentioned, it's easier to implement a dataflow executor from a register VM since registers naturally reveal the data dependency. In comparison, you need more analysis to track such dependency from a stack VM. I quickly scan the monad par paper, and feels that it's an overkill and adds more complexity to VM's compiler. Using future and callback might cause more overhead imho.

Now compare the complexity of register VM and stack VM. I think they're similar. Both VM designs need optimization passes like tail recursion, live-cycle analysis, memory planning. We can also have AllocTensor instructions in the register VM such that we can use memory manager to make the memory sharing opaque to VM's compiler. And we don't need to worry about register spilling if we support resizable register file.

Ravenwater commented 5 years ago

@jroesch Domain Flow is not Data Flow. Data Flow has no concept of space and time, just dependencies. Domain flow has a relative concept of space and time. Mapping onto a Domain Flow Architecture is akin to place-and-route, although computational graphs tend to be simpler and more structured than circuit graphs. The relationship between these two models of computation is that if you expand the computational graph to the bit-level operators you end up with a circuit graph, on which synthesis operations such as retiming and path length alignments are possible.

For tensor math and in particular tensor compositions, we want to take advantage of the structure of the tensors and tensor operators, so we will stop well short of the circuit graph and synthesize a Domain Flow program instead. But true gate synthesizers may not want to stop there and they may want to generate the full bit-level operator graph and the associated circuit graph to place-and-route on future 100BGate FPGAs. Recursion is a Random Access Memory concept that only understands sequence dependencies, and thus will never be a good abstraction for parallel hardware.

junrushao commented 5 years ago

@Ravenwater I am fine with both stack VM or register vm, but sorry I am not an expert in FPGA stuff, and could not understand many of your concerns. Particularly, I have several questions.

First, what on earth is domain flow, could you define it in a more formal language? What do you mean by "a relative concept of space and time"?

Second, about your concern about recursion. From my point of view, recursion is about expressiveness, as modern neural networks (and differentiable programming) requires Turing-completeness. Sacrificing expressiveness is to sacrifice all future opportunities for deep learning models. Is there any alternatives that are developer-friendly, PL-correct and Turing-complete without recursion?

Ravenwater commented 5 years ago

@junrushao1994

Here is the academic research background: Domain Flow and Streaming Architectures: https://ieeexplore.ieee.org/document/145479

Domain Flow Architectures are basically direct execution structures for systems of affine recurrence equations after the seminal work by Karp, Miller, and Winograd on the characterization of concurrency of parallel algorithms based on uniform and affine dependencies.

The Organization of Computations for Uniform Recurrence Equations: https://dl.acm.org/citation.cfm?id=321418

That is the basic research that has lead to the Polyhedral compilation framework which provides the mathematics behind Halide and Tiramisu (https://www.csail.mit.edu/research/tiramisu-compiler)

Ravenwater commented 5 years ago

@junrushao1994 There is also a new strain of tensor composition scheduling that is based on the Mathematics of Arrays (MoA) from Lenore Mullen

https://arxiv.org/abs/0907.0796 Tensors and n-d Arrays:A Mathematics of Arrays (MoA), psi-Calculus and the Composition of Tensor and Array Operations

The Python Numpy folks are working on an integration into Numpy.

This research comes from the IoT space where the problem is slightly different. There they have very high-dimensional data spaces that they would like to slice-and-dice and apply dimensional reduction techniques on. These slice-and-dice operators are expressed as tensor compositions, as well as these reduction algorithms such as HD-SVD and cohorts.

jroesch commented 5 years ago

The implementation of the hardware is not important to the high-level Relay program because all Tensor to Tensor functions are black box. They may be implemented anyway you want, in C++, in TVM, or as a hardware accelerator primitive. In order to map a subset of the program down to hardware you will have to unroll it, which is required for most fixed-function hardware. You can then replace the unrolled program as a new operation and rewrite the program to use this instead.

Hardware that does not support arbitrary & dynamic program sizes can not execute all models of interest, they fundamentally don't fit into Halide/TVM style DSLs. The deep learning community has focused on optimizing for small subset of models with very regular behavior, but the next wave of models invalidates assumptions, such as statically known dimensions or static control-flow required by polyhedral optimizers. The point of the Relay VM is to coordinate at a higher level where you need iteration, dynamic allocation, and communication.

I have thought further about a register based VM and see no strong argument for why registers are better than stacks. Most of the research on dynamic VMs focus on this distinction in order to reduce memory movement and dispatch overhead while executing the application. Packed functions will dominate execution time and optimizing for dispatch is an incredibly premature optimization.

The other argument for register based VMs is instruction level parallelism. Again instructions don't matter much here, meaningful parallelism is happening at data dependencies between operators, and inside the operators themselves (i.e parallel matrix mul).

The point of the parallel monad paper is not to use their technique for the source language, but use the execution model to get parallelism between operator invocations. We can view the future graph as the data dependency graph and do graph reduction over it.

For example if I depend on a sequence of function calls it is valid to evaluate them in parallel while evaluating a future computation that may depend on the results. The amount of synchronization needed here is very small, and the real opportunity for parallelism is inside operators. We don't need to worry about where the results are stored, we essentially give it a register name when we push a future into stack position n.

In a sense we already have an infinite register because we can address any stack position. In this case we can easily address a future result by referencing position n. The only difference is the location where operations look for their result. We need a call stack for functions, and function calls are the primary operation based on observations of current workloads.

Furthermore the current approach makes the VMCompiler far simpler, and easier to extend.

I personally value simplicity, we have zero evidence that the current approach is slow, in fact we have evidence of the contrary. The initial prototype is already faster than MxNet's executor which is used in production at AWS.

jroesch commented 5 years ago

I posted a draft PR of the VM so everyone can check it out, provide feedback, and play with the prototype I discussed. See #2889. I'll be polishing it over the next few days.

Ravenwater commented 5 years ago

@jroesch your argument is better than my argument, and I certainly subscribe to the power of simplicity. Hence, I will adapt and learn from your stack-based VM. Thank you for the thoughtful articulation of the pros and cons, and recording it in the development stream. This is excellent work.

tqchen commented 5 years ago

close for now as we converge in VM deisgn