apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.7k stars 3.46k forks source link

[RFC][AutoTVM] Selective Tuning #4188

Closed comaniac closed 4 years ago

comaniac commented 4 years ago

Overview

When a user wants to use AutoTVM to tune a model, she often lets AutoTVM tune every task extracted from the model sequentially. Assuming each task requires 1 hour or so, tuning a model with 10 to 100+ tasks requires days. This RFC proposes a lightweight solution to reduce tuning time for a model but still achieve a decent model performance.

We achieve this goal by proposing a selective tuning mechanism for AutoTVM. It is mainly composed of two parts: task selection and schedule sharing.

Task Selection

Tasks selection selects a few representative tasks from all tasks in a model. The idea behind it is that if a schedule works well for a conv2d on a certain platform, it may also work for another conv2d on the same platform. It implies that we can identify a few representative tasks and apply their top schedules to other tasks. As a result, the tuning time of other tasks can be saved.

The task selection algorithm is based on the similarity rate (SM). The similarity rate is defined as the ratio of tuning space overlapping between two tasks. By computing the similarity rate between any two tasks, we create a pairwise similarity matrix (PSM). With the PSM, we create a graph with tasks as nodes and SM as edge weights. Then finding all maximum cliques in the graph to cluster tasks. For each cluster, we select the task with the highest weight sum to all other tasks in the same cluster.

The API for using task selection is straightforward. A user only needs to call mark_depend after tasks have been created:

tasks = autotvm.task.extract_from_program(...)
autotvm.task.mark_depend(tasks)

After the function call, unselected tasks will have an attribute depend referring to the selected task.

Schedule Sharing

We follow the current AutoTVM process to tune the representative tasks first, and then tune other tasks. When tuning other tasks, a user can specify how many schedules should be shared from the dependent task. For example, top3 means sharing the best 3 schedules; 5% means sharing the best 5% schedules.

An example of tuning tasks with some of them selected is shown as follows:

sel_tasks = [t for t in tasks if t.depend == t]
other_tasks = [t for t in tasks if t.depend != t]

# the first pass tunes the representative tasks
for i, tsk in enumerate(reversed(sel_tasks)):
    prefix = "[Sel Task %2d/%2d] " %(i+1, len(sel_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = min(n_trial, len(tsk.config_space))
    tuner_obj.tune(n_trial=curr_trial,
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(curr_trial, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])

# the second pass tunes the rest tasks
for i, tsk in enumerate(reversed(other_tasks)):
    prefix = "[Other Task %2d/%2d] " %(i+1, len(other_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = n_trial
    tuner_obj.tune(n_trial=curr_trial,
                   depend_mode='top10', # Indicating that we will share 10 best schedules
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(10, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])

In the above example, the second loop tunes the unselected tasks. Since the tuned schedules are cached in selected tasks, the tuner will use those schedules as the tuning space, which size is 10 in this example.

There are two important advantages to this mechanism:

  1. It is fully compatible with the current AutoTVM use case. Existing AutoTVM scripts still work without any issue.

  2. Even if the unselected task fails to achieve decent performance with shared schedules, users can still re-tune them as the normal AutoTVM process. This does not hurt because the time spending on the shared schedules is just minutes.

Results

Here are the experimental results of using selective tuning for a set of models on EC2 P3 instance. We have evaluated 7 models from the Gluon CV model zoo, and the results are shown as follows. We tune selected tasks for 3,000 trials. Selective tuning achieves on average 84% performance while using only 28% tuning time. We consider the reason for outperforming the original AutoTVM is that the tuning space generated is not general enough to cover better schedules with non-factor tile sizes.

Model Tuning Time w/o Selection (mins) Perf. w/o Selection (ms) w. Tuning Time w. Selection (mins) Perf. w. Selection (ms) Used Time Achieve Perf.
MobileNet V2 1.0 1185 0.74 404 0.78 34% 95%
ResNet 50 V1 1666 2.27 358 3.7 21% 61%
VGG 19 BN 479 5.08 169 6.36 35% 80%
SqueezeNet 1.1 574 0.54 167 0.5 29% 108%
DenseNet 121 2670 2.99 377 3.02 14% 99%
Yolo3 MobileNet1.0 voc 2784 5.4 774 7.16 28% 75%
SSD512 ResNet50 V1 voc 3426 8.47 1150 5.65 34% 67%

Note

  1. The task selection implementation uses Python graph package networkx. This introduces a new dependency on AutoTVM.

  2. This mechanism works well on GPU but not CPU because the NCHWc layout limits the choices of C.

Tasks

cc @kevinthesun @icemelon9 @tqchen

kevinthesun commented 4 years ago

Thanks for this proposal! For Yolo and ssd, does the performance advantage mainly come from larger tuning space? If so, I suggest we also do full auto-tuning with expanded tuning space, so that we have an apple to apple comparison, and a more clear picture about tuning time vs performance trade-off. Also it would be interesting to see how this method can be used to select representative schedules for dynamic shape workloads.

comaniac commented 4 years ago

The leftmost two columns in the table are the total tuning time of 2,000 trials each op and the final inference latency, respectively. With XGBoost tuner, I suppose the result after 2,000 trials is sufficient to illustrate the usability of selective tuning. Comparing to the full auto-tuning result could definitely be interesting, but it is time-consuming since the tuning space is the order of 10^5 for each op. I'll further study the tuning space coverage in the short future for sure, along with the dynamic shape codegen supports. (In fact, this RFC is more like a side application during the process of dynamic shape codegen investigation 😄 )

tqchen commented 4 years ago

Thanks for the proposal. I think overall this can be categorized into schedule of the tasks. Also vc @eqy Would be nice if we can come up with abstractions for a generic task schedule strategy, then add it as a modular plugin.

The networkx dependency is something that worth discussing. Ideally we want to minimize dependencies of the core autotvm package.

comaniac commented 4 years ago

@tqchen Thanks for the comments and you're right. One important message behind this investigation is a schedule should be shared across ops with differernt attributes.

For the networkx dependency, I have the same concern as well. I used it to build a graph and find maximum cliques in the graph. Unfortunately, I didn't find any existing TVM dependency can achieve the desire functionality. On the other hand, we can also use different algorithm such as K-Means to cluster tasks, so that the existing dependency, sklearn, would be sufficient, although the result of using K-Means is not as good as the one I showed. I would be happy to take your suggestion if any.

tqchen commented 4 years ago

If we make the code modular enough via high level scheduling(planning api), the networkx dep could be an optional once for those who want to use the component, which might help alleviate the issue

comaniac commented 4 years ago

Thanks for the suggestion. Now the networkx is imported only when the selective tuning API is invoked. The implementation is here. Is this what you meant?

tqchen commented 4 years ago

yah, if that is the case, i think we should be fine for now. Let us focus the discussion on the high level API design of the order schedule(planning) API so that it works for other generic strategies.

comaniac commented 4 years ago

All tasks are done with a tutorial. @tqchen @kevinthesun @eqy @icemelon9, please review the PR and we can discuss if you have any suggestions to the API or designs.

Thanks.

tqchen commented 4 years ago

Thanks @comaniac.

It would be great if people can discuss a bit more in terms of API choices. I think we achieved the proposed goals.

Specifically, here are some "complains" I have by quickly looking into the API:

I do not have time too think too deep into this. but perhaps others can do a review and discuss the possible candidates?

kevinthesun commented 4 years ago

I agree that we can think a bit more about the name "select" and "depend". These two names have rich semantics in different context. Maybe we would like some API more tightly bind to autotvm specifically.

kevinthesun commented 4 years ago

Also by further discussing with @comaniac, we thought that it's worth a bit further dig into the reason why full tuning for ssd and yolo only achieve 60% -70% performance of selective tuning. Maybe increase the number of trials of full tuning can give us a more clear picture.

comaniac commented 4 years ago

While I am testing if tuning more trials could make the result more intuitive, I would like to first ask for the feedbacks about the naming. Here are my thoughts based on Tianqi's comments.

@kevinthesun @eqy @icemelon9 please let me know what you guys think about these names, and you're welcome to propose better ones. Thanks.

comaniac commented 4 years ago

@kevinthesun Your assumption was correct. After increasing the trial number, the selective tuning achieves 61% (ResNet 50), 75% (YOLO3), and 67% (SSD) to the all tuning version. This also a good motivation to the search algorithm improvement, but we can open another RFC for it. For now, I think we could focus on finishing the selective tuning. Thanks.

yzhliu commented 4 years ago

Can user specify the similarity distance function?

would it be better to call it "share" rather than "ref", as "ref" reminds me of "reference" in programming language.

comaniac commented 4 years ago

Thanks for the comments! Please see my responses and let me know what you think.

  1. If we define a distance function is a function that returns the distance (in any metric) of any two tasks (callback(tasks: List[Task]) -> List[List[Float]]), it is still unclear how should we group them.

  2. If we directly let user pass a callback function to group tasks (callback(tasks: List[Task]) -> None), it functions exactly the same as mark_depend, so it seems redundent.

Based on the concerns, I'll prefer the current solution, and we can revisit it in the future when we have a more concrete idea about task pass.

kevinthesun commented 4 years ago

@comaniac Thank you for these data. Can you update in the main thread?

comaniac commented 4 years ago

Updated.

Edwardmark commented 4 years ago

@comaniac Hi, the time cost for detection networks, did you include the nms post-processing steps? And how you implemented nms?Using other frameworks or using tvm tensor operations? Thank you very much for your great work. Best, Edward

comaniac commented 4 years ago

The tuning time posted here is the total time of tuning each task. In AutoTVM, one task means one op. Since we don't have a tunable template for NMS yet, the tuning time should include it. For the network, I directly get the definition from the Gluon CV model zoo.

tqchen commented 4 years ago

closing for now as we are moving towards a new autoscheduler