merrymercy commented 5 years ago

Update(Dec. 25, 2020): This RFC is deprecated. We started another project "Ansor" to bring auto-scheduler for TVM. Ansor is integrated as tvm.auto_scheduler package in the current code base. You can see the new RFC and tutorials.

Auto-Scheduler

TVM decouples kernel implementation into compute and schedule. The compute part is a friendly DSL that can describe algorithms intuitively. However, the schedule part still requires strong expert knowledge and time-consuming tuning to provide decent performance. The tuning process is partially automated by the existing autotvm package, but a human-engineered template is still required.

This RFC proposes a "real" autotvm, which we can call auto scheduler. It aims at removing all human efforts on the schedule part.

Proposed Design

The auto-scheduler is built on the existing autotvm package. It will generate a template from compute declaration. Then this template can either be

Statically filled by heuristic rules and cost functions to provide reasonable performance, or
Dynamically tuned by autotvm to provide better performance with some time budget

The auto-scheduler takes a computation graph described by tvm DSL as input, then classify the type of read/write patterns and the type of computation. It dispatches the nodes in the DAG to different "meta templates". The "meta templates" generates autotvm templates from the compute declaration. There are four types of meta templates : simple reduction, complex reduction, direct compute, and location-tunable compute. The auto-scheduler will do parallelization, vectorization, tiling, and operator fusion.

The code is available on my branch. The current implementation is in pure python bacuse autotvm is mainly written in python. But move the whole autotvm package to c++ is within long-term plan. The code is organized as follows.

Analysis on access pattern python/tvm/autotvm/auto_schedule/stage_analysis.py
CPU backend python/tvm/autotvm/auto_schedule/backend/cpu.py
GPU backend python/tvm/autotvm/auto_schedule/backend/gpu.py
Configuration for the auto-scheduler python/tvm/autotvm/auto_schedule/common.py
Experimental auto-packing for optimizing vectorization and locality python/tvm/autotvm/auto_schedule/auto_pack.py
Test cases tests/python/unittest/test_auto_scheduler.py

API

There are only two user-oriented API calls

autotvm.AutoSchedulerOptions(**kwargs) This is used to configure the auto scheduler. The arguments include hardware configurations(vector lanes, number of threads, size of shared memory, etc) and tuning configurations (how many tuning knobs to generate).
autotvm.create_schedule(tensors) This is similar to tvm.create_schedule, but returns an already optimized schedule.

A = tvm.placeholder((128,), name='A')
B = tvm.placeholder((128,), name='B')
C = tvm.compute((128,),  lambda i: A[i] + B[i] * 2)

with tvm.target.create('llvm'):
    with autotvm.AutoSchedulerOptions(vec_size=8, num_threads=16):
        s, bufs = autotvm.create_schedule([A, B, C])

# NO SCHEDULE REQUIRED

func = tvm.build(s, bufs)

Examples

Tutorial This is a tutorial on how to statically use the auto-scheduler or auto-tune it.
Schedule a whole network This example is adopted from #2498. It is a LeNet like convolution neural network written purely by tvm (without graph IR). The auto-scheduler also provides basic operator fusion for it. Right now we can only run forward pass. I am working on fixing the backward pass.

Performance

One reachable performance goal is to replace more than 90% schedule code in existing TOPI by this auto-scheduler. I haven't done the experiments, but I believe the generated templates can cover the existing search space for most operators (includes conv2d, reduction, ...).

Another part of the goal is to provide reasonable static performance. In the "Schedule a whole network" example, for batched forward pass, the current performance is 1.2x slower than out-of-the-box TF + Keras, and 10x faster than naive schedule (fuse and parallel outer loops) on an Intel i7-8750H. For static usage, the input of the auto-scheduler are parameters for heuristic rules and hardware configurations. We will gather all inputs into a global config, so users can still do some quick "tuning".

Todo List

[ ] Performance test and improvement to cover more than 90% schedule code in TOPI Improve the heuristic rules to provide better static performance, do tests to make sure we cover the search space of existing templates.
[ ] Improve tuning speed The current implementation does analysis and generates the template on the fly, which is expensive and redundant during batched tuning. We should decouple the template generation and template tuning, and explicitly cache the template.
[ ] (long-term) Move all autotvm related code to c++
[ ] Improve loop partition to better handle partial tile, vectorization.

kevinthesun commented 5 years ago

Thank you for opening this RFC! I have a question regarding user API. Does the hardware information needed for autotvm.AutoSchedulerOptions(**kwargs) function pre-defined for different hardware architectures? If so, how much more information does a user need to provide to differentiate between different minor types of the same device target, such as Intel Xeon Platinum vs Xeon Haswell, or Nvida K80 vs V100? Today we have a single template for minor device types. Will auto-scheduler provide different templates?

jroesch commented 5 years ago

@merrymercy how much work is there per backend? looking over the code now will follow up with more questions later.

yzhliu commented 5 years ago

@merrymercy Could you elaborate a bit about the 4 types (simple reduction, complex reduction, direct compute, and location-tunable compute) ? Also it would be helpful if you can give an example of how the DAG looks like.

tmoreau89 commented 5 years ago

Thanks @merrymercy, this is really awesome work. I second Jared's comment on work involved in adding a backend. I'd be happy to chat some more about how one would add automated compilation to different hardware accelerators including VTA.

merrymercy commented 5 years ago

@kevinthesun The hardware parameters for the auto-scheduler are very coarse-grained. These parameters are most used in static scheduling. So it even won't distinguish between ARM CPU and Intel CPU. If you want to fit a specific target device, we still need to do auto-tuning on real devices.

@jroesch Currently, it is about 500 loc per backend. I am working on improvements so it may increase.

@yzhliu

simple reduction: reduction ops that do not have reuse opportunity (e.g. softmax, argmin)
complex reduction: reduction ops that have reuse opportunity (e.g. matmul, conv2d)
direct compute: broadcast, elemwise, stencil computation, (e.g. relu, add)
location-tunable compute: the same as above. The difference is that direct compute computes at root, while location-tunable compute can computes at other nodes to increase locality.

@tmoreau89 This is doable. The problem of accelerators is that if we want the auto-scheduler to take in a hardware-independent description, then we need a special pack pass to transform the layout.

jroesch commented 5 years ago

@merrymercy I'm less interested in LOC and more how much conceptual burden there is. My question is more: What are the key pieces that make up a backend description?

I looked over the code but was at SysML and have two deadlines this week so I haven't had a chance to really look it over. Look forward to landing this stuff.

One idea I've been thinking about is a combined TVM + Relay language where we can auto-extract chunks that can be lowered to the compute language, auto-schedule, then auto-tune for end-to-end perf.

kevinthesun commented 5 years ago

@merrymercy Auto-scheduler will create another search space consists of schedule templates. For a given set of hardware parameters, it will try various schedule templates and for each template do some auto-tuning on real device. This means for each minor device type, we need to do all these steps. Do I understand it correctly?

yzhliu commented 5 years ago

@merrymercy Do you think this analysis design can be easily extended to be working based on TVM Tensor AST (HalideIR) instead of ScheduleStage? Not urgent but I think eventually we will make schedule primitives work on HalideIR, so that we can unify the underlying data structure of schedule and other passes.

tqchen commented 5 years ago

Good discussions, I think in general we can move to summarize the common patterns and make things work for specific hardware backend. As for point bought by @yzhliu (unifying schedule with pass), eventually ScheduleStage itself(or other IR structure) can be viewed as a dialect of the IR, and we can do so after we push for such unification.

merrymercy commented 5 years ago

@jroesch There is no easy description for a backend. Currently these meta-templates are mainly based on the summary of existing human schedule code in TOPI. So adding a new backend is still hard. What can be reused is the classification of compute type.

@kevinthesun There is only one template for one specific op. The auto-scheduler first creates this template. Then, for static usage, it will fill the knobs in the template according to hardware paremeters. The API example shown above falls in this category. For tuning usage, the auto-scheduler won't use hardware parameters. Instead, it relies real tuning. In this case, you need to explicitly create autotvm.Task, autotvm.Tuner as what we do currently. An example is shown in the tutorial.

@yzhliu The tvm.compute dsl is much easier to analyze than general Halide IR, because of its clean dependency relations and simple loop structures.

eqy commented 5 years ago

Minor question: do we consider "injective" as a special case of "simple reduction?"

eqy commented 5 years ago

@merrymercy Do you think that this is a good time to also make schedules serializable/package them with autotvm style configs? In the past we have had issues where we did not want to merge in changes to schedules because they would break compatibility with tophub, and now it seems that the variety of schedules may also change quickly as auto-schedule is changed. Instead of forcing schedules to be schedule, we can maybe side-step this by packaging schedules together with autotvm configs.

merrymercy commented 5 years ago

@eqy "injective" is considered "direct compute". Typically they will be inlined.

Serializable Template + Serializable Config seems to be a good direction to go.

yangjunpro commented 4 years ago

@merrymercy Hi Lianmin,

Thanks for the nice proposal. May I know the latest progress of the auto-scheduling work? It looks that for a long time there isn't any status update.

Regards Jun

hello-hzb commented 4 years ago

@merrymercy Hi Zheng， I have paied attension to your auto-scheduler work for a few days. No update for a few month. How is it going these days? Why don't you merge the autoshceduler to the master branch of TVM?

merrymercy commented 4 years ago

Hi @yangjunpro @hello-hzb , This project has been suspended for several months. I won't continue my work on the original branch. However, the push for an auto-scheduler is still interesting to a lot of people. I might work on auto-scheduler again with some Berkeley students. We'd like to try different approaches, so we won't start from my old branch.

yzhliu commented 4 years ago

@merrymercy would you mind summarize a bit what's the drawback of the original implement, so we can learn from it.

yangjunpro commented 4 years ago

Hi @yangjunpro @hello-hzb , This project has been suspended for several months. I won't continue my work on the original branch. However, the push for an auto-scheduler is still interesting to a lot of people. I might work on auto-scheduler again with some Berkeley students. We'd like to try different approaches, so we won't start from my old branch.

Sure, I think Zhao has already contacted with you and also involve two of my colleagues Minmin and Chenfan. Look forward to further collaborations.

tqchen commented 4 years ago

close as per ansor update

tqchen commented 4 years ago

https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005

apache / tvm

[RFC][AUTOTVM] Auto-Schedule from Compute Declaration #2954

Auto-Scheduler

Proposed Design

API

Examples

Performance

Todo List