[RFC][AutoTVM] Selective Tuning

comaniac commented 4 years ago

Overview

When a user wants to use AutoTVM to tune a model, she often lets AutoTVM tune every task extracted from the model sequentially. Assuming each task requires 1 hour or so, tuning a model with 10 to 100+ tasks requires days. This RFC proposes a lightweight solution to reduce tuning time for a model but still achieve a decent model performance.

We achieve this goal by proposing a selective tuning mechanism for AutoTVM. It is mainly composed of two parts: task selection and schedule sharing.

Task Selection

Tasks selection selects a few representative tasks from all tasks in a model. The idea behind it is that if a schedule works well for a conv2d on a certain platform, it may also work for another conv2d on the same platform. It implies that we can identify a few representative tasks and apply their top schedules to other tasks. As a result, the tuning time of other tasks can be saved.

The task selection algorithm is based on the similarity rate (SM). The similarity rate is defined as the ratio of tuning space overlapping between two tasks. By computing the similarity rate between any two tasks, we create a pairwise similarity matrix (PSM). With the PSM, we create a graph with tasks as nodes and SM as edge weights. Then finding all maximum cliques in the graph to cluster tasks. For each cluster, we select the task with the highest weight sum to all other tasks in the same cluster.

The API for using task selection is straightforward. A user only needs to call mark_depend after tasks have been created:

tasks = autotvm.task.extract_from_program(...)
autotvm.task.mark_depend(tasks)

After the function call, unselected tasks will have an attribute depend referring to the selected task.

Schedule Sharing

We follow the current AutoTVM process to tune the representative tasks first, and then tune other tasks. When tuning other tasks, a user can specify how many schedules should be shared from the dependent task. For example, top3 means sharing the best 3 schedules; 5% means sharing the best 5% schedules.

An example of tuning tasks with some of them selected is shown as follows:

sel_tasks = [t for t in tasks if t.depend == t]
other_tasks = [t for t in tasks if t.depend != t]

# the first pass tunes the representative tasks
for i, tsk in enumerate(reversed(sel_tasks)):
    prefix = "[Sel Task %2d/%2d] " %(i+1, len(sel_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = min(n_trial, len(tsk.config_space))
    tuner_obj.tune(n_trial=curr_trial,
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(curr_trial, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])

# the second pass tunes the rest tasks
for i, tsk in enumerate(reversed(other_tasks)):
    prefix = "[Other Task %2d/%2d] " %(i+1, len(other_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = n_trial
    tuner_obj.tune(n_trial=curr_trial,
                   depend_mode='top10', # Indicating that we will share 10 best schedules
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(10, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])

In the above example, the second loop tunes the unselected tasks. Since the tuned schedules are cached in selected tasks, the tuner will use those schedules as the tuning space, which size is 10 in this example.

There are two important advantages to this mechanism:

It is fully compatible with the current AutoTVM use case. Existing AutoTVM scripts still work without any issue.
Even if the unselected task fails to achieve decent performance with shared schedules, users can still re-tune them as the normal AutoTVM process. This does not hurt because the time spending on the shared schedules is just minutes.

Results

Here are the experimental results of using selective tuning for a set of models on EC2 P3 instance. We have evaluated 7 models from the Gluon CV model zoo, and the results are shown as follows. We tune selected tasks for 3,000 trials. Selective tuning achieves on average 84% performance while using only 28% tuning time. We consider the reason for outperforming the original AutoTVM is that the tuning space generated is not general enough to cover better schedules with non-factor tile sizes.

Model	Tuning Time w/o Selection (mins)	Perf. w/o Selection (ms)	w. Tuning Time w. Selection (mins)	Perf. w. Selection (ms)	Used Time	Achieve Perf.
MobileNet V2 1.0	1185	0.74	404	0.78	34%	95%
ResNet 50 V1	1666	2.27	358	3.7	21%	61%
VGG 19 BN	479	5.08	169	6.36	35%	80%
SqueezeNet 1.1	574	0.54	167	0.5	29%	108%
DenseNet 121	2670	2.99	377	3.02	14%	99%
Yolo3 MobileNet1.0 voc	2784	5.4	774	7.16	28%	75%
SSD512 ResNet50 V1 voc	3426	8.47	1150	5.65	34%	67%

Note

The task selection implementation uses Python graph package networkx. This introduces a new dependency on AutoTVM.
This mechanism works well on GPU but not CPU because the NCHWc layout limits the choices of C.

Tasks

[x] Implementation #4187
[x] Tutorial

cc @kevinthesun @icemelon9 @tqchen

kevinthesun commented 4 years ago

Thanks for this proposal! For Yolo and ssd, does the performance advantage mainly come from larger tuning space? If so, I suggest we also do full auto-tuning with expanded tuning space, so that we have an apple to apple comparison, and a more clear picture about tuning time vs performance trade-off. Also it would be interesting to see how this method can be used to select representative schedules for dynamic shape workloads.

comaniac commented 4 years ago

The leftmost two columns in the table are the total tuning time of 2,000 trials each op and the final inference latency, respectively. With XGBoost tuner, I suppose the result after 2,000 trials is sufficient to illustrate the usability of selective tuning. Comparing to the full auto-tuning result could definitely be interesting, but it is time-consuming since the tuning space is the order of 10^5 for each op. I'll further study the tuning space coverage in the short future for sure, along with the dynamic shape codegen supports. (In fact, this RFC is more like a side application during the process of dynamic shape codegen investigation 😄 )

tqchen commented 4 years ago

Thanks for the proposal. I think overall this can be categorized into schedule of the tasks. Also vc @eqy Would be nice if we can come up with abstractions for a generic task schedule strategy, then add it as a modular plugin.

The networkx dependency is something that worth discussing. Ideally we want to minimize dependencies of the core autotvm package.

comaniac commented 4 years ago

@tqchen Thanks for the comments and you're right. One important message behind this investigation is a schedule should be shared across ops with differernt attributes.

For the networkx dependency, I have the same concern as well. I used it to build a graph and find maximum cliques in the graph. Unfortunately, I didn't find any existing TVM dependency can achieve the desire functionality. On the other hand, we can also use different algorithm such as K-Means to cluster tasks, so that the existing dependency, sklearn, would be sufficient, although the result of using K-Means is not as good as the one I showed. I would be happy to take your suggestion if any.

tqchen commented 4 years ago

If we make the code modular enough via high level scheduling(planning api), the networkx dep could be an optional once for those who want to use the component, which might help alleviate the issue

comaniac commented 4 years ago

Thanks for the suggestion. Now the networkx is imported only when the selective tuning API is invoked. The implementation is here. Is this what you meant?

tqchen commented 4 years ago

yah, if that is the case, i think we should be fine for now. Let us focus the discussion on the high level API design of the order schedule(planning) API so that it works for other generic strategies.

comaniac commented 4 years ago

All tasks are done with a tutorial. @tqchen @kevinthesun @eqy @icemelon9, please review the PR and we can discuss if you have any suggestions to the API or designs.

Thanks.

tqchen commented 4 years ago

Thanks @comaniac.

It would be great if people can discuss a bit more in terms of API choices. I think we achieved the proposed goals.

Specifically, here are some "complains" I have by quickly looking into the API:

select.py was a bit confusing(it reminds me of select API of socket networking, but it is not?)
- It is more like a annotation API that annotates the tasks, can we treat it as a pass over the set of tasks(if that is what we want
The depend field in the autotvm should be documented, and we need to think about a better name (Seems the current specified behavior is to stop tuning x) and make use of dependent data

I do not have time too think too deep into this. but perhaps others can do a review and discuss the possible candidates?

kevinthesun commented 4 years ago

I agree that we can think a bit more about the name "select" and "depend". These two names have rich semantics in different context. Maybe we would like some API more tightly bind to autotvm specifically.

kevinthesun commented 4 years ago

Also by further discussing with @comaniac, we thought that it's worth a bit further dig into the reason why full tuning for ssd and yolo only achieve 60% -70% performance of selective tuning. Maybe increase the number of trials of full tuning can give us a more clear picture.

comaniac commented 4 years ago

While I am testing if tuning more trials could make the result more intuitive, I would like to first ask for the feedbacks about the naming. Here are my thoughts based on Tianqi's comments.

select.py -> pass.py As suggested, this module is more like a pass over a set of tasks, so we can treat it as a pass. I can implement it as a pass table so that we can add other passes in the future.
Select representative tasks. Origianl: autotvm.task.mark_depend(tasks) With the pass implemented above, this API becomes autotvm.task.pass.FindReference(tasks), meaning that this pass is going to find the reference task for each task.
task.depend -> task.cfg_ref_task cfg_ref_task points to the task that we can refer its tuned configs when tuning.
tuner.depend_mode -> tuner.cfg_ref_mode cfg_ref_mode is a string in the <mode>-<cfg> format. <mode> can be either "only" or "start"; while <cfg> can be either "topN" or "N%". Here are some examples:
- "only-top10": Stop tuning after trying the top 10 configs in the cfg_ref_task.
- "start-5%": First try the top 5% configs in the cfg_ref_task and back to the normal tuning.
- The default is set to "only-top10", meaning that we will stop tuning after 10 trials if the task we are tuning has a tuned reference task.

@kevinthesun @eqy @icemelon9 please let me know what you guys think about these names, and you're welcome to propose better ones. Thanks.

comaniac commented 4 years ago

@kevinthesun Your assumption was correct. After increasing the trial number, the selective tuning achieves 61% (ResNet 50), 75% (YOLO3), and 67% (SSD) to the all tuning version. This also a good motivation to the search algorithm improvement, but we can open another RFC for it. For now, I think we could focus on finishing the selective tuning. Thanks.

yzhliu commented 4 years ago

Can user specify the similarity distance function?

would it be better to call it "share" rather than "ref", as "ref" reminds me of "reference" in programming language.

comaniac commented 4 years ago

Thanks for the comments! Please see my responses and let me know what you think.

In the current proposal, user can manually specify the depdendency by task.depend = rep_task, and the user can implement any function to do so. On the other hand, we can also make it as a callback function like autotvm.task.mark_depend(tasks, user_callback), but it raises concerns about the user callback function signature:

If we define a distance function is a function that returns the distance (in any metric) of any two tasks (callback(tasks: List[Task]) -> List[List[Float]]), it is still unclear how should we group them.
If we directly let user pass a callback function to group tasks (callback(tasks: List[Task]) -> None), it functions exactly the same as mark_depend, so it seems redundent.

Based on the concerns, I'll prefer the current solution, and we can revisit it in the future when we have a more concrete idea about task pass.

New names based on your suggestions:
- task.depend -> task.share_from
- tuner.depend_mode -> tuner.share_mode

kevinthesun commented 4 years ago

@comaniac Thank you for these data. Can you update in the main thread?

comaniac commented 4 years ago

Updated.

Edwardmark commented 4 years ago

@comaniac Hi, the time cost for detection networks, did you include the nms post-processing steps? And how you implemented nms?Using other frameworks or using tvm tensor operations? Thank you very much for your great work. Best, Edward

comaniac commented 4 years ago

The tuning time posted here is the total time of tuning each task. In AutoTVM, one task means one op. Since we don't have a tunable template for NMS yet, the tuning time should include it. For the network, I directly get the definition from the Gluon CV model zoo.

tqchen commented 4 years ago

closing for now as we are moving towards a new autoscheduler

apache / tvm