apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.42k stars 3.4k forks source link

[Bug] [MetaSchedule] Missmatched tasks due to wrong enumeration of constants in module #15548

Open PhilippvK opened 11 months ago

PhilippvK commented 11 months ago

In e030b14, the task extraction mechanism for MetaScheudle was adapted to use a Module-scope NameSupply for the naming of constants. However it seems like due to the use of PostOrderVisit in https://github.com/apache/tvm/blob/927df5966237f10978319044716d93c90bf8843c/src/relay/backend/task_extraction.cc#L80C10-L80C10, the generation of constant names starts with the last fused function (layer/layer/task) in the model while in the normal compilation flow it seems to be the other way around.

This leads to a few problems/limitations when matching the tasks using all ModuleEquality implementations

  1. The names of constants in a single tasks depends on all the previous/past layers in the model which makes it impossible to reuse tuning results for a single layer which also exits in a different model.
  2. Due to the swapped ordering of constant names I get missmatches when trying to use a previously trained database on the same TFLite model.

Expected behavior

The tuned records for the MLPerf Tiny KWS model should be used during compilation.

Actual behavior

Warning about failed lookup for tasks in JSON database:

[11:46:25] .../llvm-gen/tvm/src/relay/backend/te_compiler_cache.cc:676: Warning: Cannot find workload: tvmgen_default_fused_nn_conv2d_subtract_add_fixed_point_multiply_per_axis_add_clip_cast_7

Here is the diff between the TIR used in TVM compilation pipeline (green) vs. the one in the MetaScheduler Tuning database:

tvm_meta_bug

The data of the NDArrays (ommited here for cutting down the diff) matches (which would not even be an issue when using the ignore-ndarray ModuleEquality), hence the only changes in the diff are die names of the constants.

Environment

Steps to reproduce

Will follow up with a script later today!

Triage

Please refer to the list of label tags here to find the relevant tags and add them below in a bullet format (example below).

CC @masahi @mbs-octoml

cc @ibsidorenko

masahi commented 11 months ago

I don't remember the context of the PR well. But at that time I was working on MS tuning for Hexagon with link_params = True, so I don't know why I didn't hit this problem.

There is no deep reason to use PostOrderVisit there, if you can come up with a way to visit the expr in the same way that TE compiler does (and make sure to keep them in sync), feel free to make such change.

In principle, the order of returned tasks shouldn't matter. But it's unfortunate that things are not as straightforward as it should be.