Please leave any comments or edit this issue directly to adjust the release notes! Also see the rc0 vote thread in #12103.

Introduction

The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

MetaSchedule's full implementation
ARM cascading scheduler for Arm Ethos(TM)-U NPUs
Collage which brings tuning to BYOC
Several microTVM improvements
New tvm.relay.build parameters: runtime=, executor=,
AOT: support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime
Hexagon RPC support
- Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
- AOT and USMP support
- Threading
- Initial op support
MLF: support for multiple modules in a single MLF artifact
Several TIR schedule primitives and transforms including (abridged):
- schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.
- schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.
- schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).
- transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.
- transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.
- transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.
- transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.
And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.8.0...v0.9.0.rc0.

AOT

11208 - Calculate used memory at the callsite of primitive functions
11365 - Fix function number datatype from char to uint16_t
11091 - Enable A-Normal Form in the AOT executor
10753 - Support LLVM backend with C++ runtime
10518 - Use python temporary directory for AOT tests
10337 - BugFix of workspace calculation
10282 - [runtime] Add Metadata classes for AOTExecutor
9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery
11144 - Remove support for run-time linked-params from codegen
10590 - Add order to functions in C Codegen
11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL
11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions
DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797
TensorRT - #11923, #11203, #10759, #10772, #10388
CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331
OpenCLML - #10243
CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #9595, #9571
CUDNN - #10997, #9986, #9948
ACL - #10801
PTX - #10855, #10339, #9909
CUBLAS - #10826, #10820

CI

11313 - Refactor of tvm.testing.requires_* annotations
11666 - Enable pylint for tests/python/ci
11657 - Apply linting rules to AOT tests
11380 - Restructure Jenkinsfile
Automation - #11813, #11775, #11480, #11437, #10833, #10056, #9973, #9934
User experience improvements - #11470, #11329, #11553, #11497, #11051, #10933, #10960, #10525, #10425, #10322, #10121, #9971, #9554, #9752, #9556
Reduce CI runtime - #11402, #11349, #11258, #11132, #10946, #10743, #10359
Code cleanups - #10968, #10740

Frontends

PaddlePaddle - #11537, #9724, #9564
TFLite - #10915, #10566
Oneflow - #11321, #11036, #8790
PyTorch - #11190, #10504, #10184, #10091
ONNX - #10949, #9438, #9186, #9493, #9475
Keras - #7006

Hexagon

11549 - Initial clip operator for Hexagon
11834 - Add op resize2d for hexagon
11559 - Softmax slice op initial version
11529 - Slice ops added - add, subtract, multiply
11720 - [testing] add max_pool2d benchmark
11417 - Implement avg_pool2d slice op
11653 - Add HexagonThreadManager
11547 - Run single RPC server on Android in each testing session
11490 - [testing] add TVMScript elemwise-add
11400 - [testing] refactor benchmark-table code
11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures
11319 - Add unit tests for Hexagon Device API
11279 - Add USMP tests
11283 - Update Readme
11239 - capture gtest output and return over FFI
11175 - Add schedule and test for conv2d_transpose_nchw
11018 - [Runtime] Add QuRT thread pool backend
11145 - Add support for on-device unit testing using gtest
11138 - Add test for depthwise conv2d schedule
11016 - Add test for registered schedules
11104 - Add mobilenet test
11090 - Delete offload runtime, move files to right places
11065 - AoT with LLVM Codegen on Hexagon
11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON
10604 - HVX scheduling and bench-marking of TE element-wise add
10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout
10907 - Move aot/graph_executor interactions into launcher
10919 - Register basic strategies and schedules for common operators
10904 - Add unit tests executing 2-d VTCM usage
10910 - Refactor to keep HexagonBuffer private to the device api
10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU
10878 - Generalized HexagonBuffer::CopyTo/CopyFrom
10846 - Support both 1-d and 2-d VTCM allocations
10581 - Improved ergonomics of HexagonLauncher in unit tests.
10616 - Refactor tvm.contrib.hexagon, NFC
10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake
10586 - Codegen for 2d Load/Store
10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon
10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.
10507 - Add doc on TVM - Hexagon RPC flow
10520 - Resolve breakage in test_hexagon/test_cache_read_write
10311 - [runtime]AOTExecutor implementation for C Codegen
10454 - Allow execution on target or simulator from HexagonLauncher
10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize
10361 - RPC server/client for simulator
10302 - [CI]Add Hexagon Tests to pipeline
10263 - [Docker]Add docker file and scripts
10227 - Refactor Hexagon.cmake
10217 - Adding support for Hexagon User DMA Engine
10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc
9970 - Do not auto-build apps when building TVM
9736 - Add unit tests for HexagonBuffer
9525 - Add Hexagon VTCM and discontiguous allocation support
9631 - Add RPC Mechanism for Hexagon
9473 - cleanup Hexagon conv2d tests

MetaSchedule

11884 - Postproc: Rewrite-Layout
11848 - [OpStrategy] Support MetaSchedule Layout
11845 - [Relay][Pass] Meta-Schedule-Layout-Rewrite
11758 - [Runtime] Enhance Runner RandomFill
11683 - Distributed Measurement
11751 - [Minor] Organize Testing Scripts
11735 - Modify Profiler Timers
11727 - Developer Ergonomics Enhancement II
11692 - Apply-History-Best Task Filtering
11486 - Add Profiler Support For Tuning Efficiency Optimization
11680 - JSONDatabase Utilities
11641 - Generate MetaSchedule Dataset
11622 - Developer Ergonomics Enhancement
11604 - Resolve dependencies between header files
11587 - Add Testing Script with ONNX Support
11590 - Evo Independence from TaskScheduler
11534 - No explicit unrolling for spatial PrimFunc
11512 - Enable Task Filtering
11177 - AutoBind rule and MutateThreadBinding
11157 - Logging Interface Unification
11088 - Auto tensorization for CPU / GPU dot product
10986 - [Refactor] Introduce TuneConfig
11020 - [Metaschedule, Refactor] Move MultiLevelTilingNode decl to a header
10927 - [Refactor] Clarify Integration Logic
10876 - Add utility API to ease using manual schedules
10885 - [BugFix] Fix skipped tests
10366 - Add Gradient Based Task Scheduler
10823 - Fine-Grained Rewrite Unbound Block
10793 - Add demonstration of selectively tuning relay ops with TIR schedules
10811 - Support grouping in the cost model
10810 - Extract task weights during task extraction
10782 - [TIR]Estimate TIR FLOPs
10776 - Misc updates for tuning end-to-end workloads
10689 - Upstream the leftover changes
10648 - [Meta Schedule] Refactor meta schedule testing utils
10578 - New relay backend for meta schedule task extraction
10534 - Bug Fix for Relay Integration
10501 - Update scripts for subgraph tuning
10497 - Refactor testing workloads
10461 - Enable AutoTVM-style template-based search space
10368 - Fix Cyclic Dependency in PyClass Family
10403 - Arithmetic analysis
10367 - Update Tuning Interfaces.
10079 - [M4a] User-API: Tune-TE/TIR/Relay
10081 - [M4a] Rewrite-Cooperative-Fetch
10055 - [M4b] Testcases for TensorRT builder/runner
10092 - [M4a] Mutator: Mutate-Tile-Size
10096 - [M4a] Mutator: Mutate Parallel
10071 - [M4a] PostProcessor: Rewrite-Parallel-Vectorize-Unroll
10043 - [M4a] Schedule Rule: Multi-Level-Tiling
10045 - Mutator: Mutate-Unroll
10033 - [M4a] Schedule Rule: Parallelize-Vectorize-Unroll
10027 - [M4a] PostProcessor: Rewrite-Unbound-Block
10028 - Mutator: Mutate-Compute-Location
9997 - [M4a] PostProcessor: Disallow-Dynamic-Loop
9994 - [M4a] Schedule Rule: Cross-Thread-Reduction
10013 - [M4a] PostProcessor: Rewrite Reduction Block
9975 - [M4a] Schedule Rule: Add-RFactor
9945 - [M4a] PostProcessor: Verify-GPU-Code
9940 - [M4a] Schedule Rule: Random-Compute-Location
9943 - [M4a] Schedule Rule: Auto-Inline
9860 - [M3c] Add Per-Store-Feature
9859 - [M3c] XGB-based Cost Model
9836 - [M4a] Add EvolutionarySearch Search Strategy
9799 - [M4a] Add ReplayFunc Search Strategy
9789 - [M3c] Update TuneContext, TaskScheduler & Search Strategy Design
9780 - [M3c] Add More Measure Callbacks
9761 - [M4a] Add ScheduleRule class & PostOrderApply space generator
9760 - [M3c] Random Feature Extractor

MicroTVM

11741 - Refactor RVM scripts and fix DNS network issue
11472 - [ARM]Add tests for arm schedules
11634 - Update pyproject to python3.7
Zephyr support - #11650
RPC - #11227, #10967

Relay

11825 - [realy][pass]add split infer shape with convert op layout pass
11674 - Finish implementations of WithFields
11481 - IndexedGraph improvements in preparation for Collage
11432 - Plumb external codegen target via Target.current()
11494 - [Pass] Add MaxPool, AvgPool to FoldExplicitPadding
11183 - Add unidirectional sequence lstm
11442 - Add 'static_library' runtime::Module
11413 - [Topi]Support for FP16 ERF on CPU.
11382 - Finish support for list-of-targets
11386 - [Tests] Replace the Relay interpreter with the VM in the op tests
11224 - Support i16, f16 scalars in Relay text
11337 - Fix eltwise alter op layout for broadcast axis
11199 - Flexible shape dispatch transformation
11173 - Support 'external codegen targets'.
10996 - Add FlattenAtrousConv transformation
10871 - [CUDNN] Add cuDNN as a Relay partitioning target (BYOC)
10787 - [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite.
10378 - [FQ2I] Add leaky relu to FQ21
10400 - RelayViz graphviz renderer
10352 - [VIRTUALDEVICE] Change syntax for device planning and store parameter virtual devices in virtualdevice field
10310 - [ARM_CPU] Conv2d int8 intrinsic for cortex-A72
10085 - RelayViz interface and terminal ast-dump
10239 - Add a conversion of individual operations in FQ2I pass.
10236 - [Refactor] Clean up type relations that are declared as template for no reason
10156 - Fix broadcast InferCorrectLayout
10026 - [VM] Relay VM memory liveness/lifetime analysis
10089 - [Pass] Add a relay pass to extract fake quantized ops
9690 - Change function constructors to WithFields
10069 - [DefuseOps pass] bug fix: To support function body types other…
9954 - Add conv2d_backward_weight op (without topi)
9838 - [FoldScaleAxis] Support dense and bias_add op in fold scale axis
9816 - Add sliding_window operator
9874 - Add a JSON converter for 0.7 -> 0.8 and 0.8 -> 0.9
9735 - [AMP][Pass][Typing] Add faster type inference
9723 - [Frontend] Add Span filling for frontends to Relay
9749 - Fix invalid shape function for "copy" operator
9759 - s/SEScope/VirtualDevice/g
9734 - Support large constants saved/loaded outside of VM executable
9613 - Re-run PlanDevices after LowerTE to flow new memory scope constraints.
9693 - PlanDevices supports 'free' on_device annotations
9641 - [AST] Add virtual_device as a first class field in Relay
9483 - Switch the VM to use the LowerTE pass instead of TECompiler::{Lower,LowerShapeFunc}.
9569 - WithFields method for Call, Function, Var, TupleGetItem, If, Let, RefCreate, RefRead, RefWrite, Match, and Clause
9533 - WithFields for Tuples
9550 - Prepare for switching VM to LowerTEPass.
9542 - Prepare DeadCodeElimination for running post LowerTEPass/ManifestAlloc.
9352 - [TVMC]Introduce executor and runtime parameters
9457 - Add the Arm(R) Ethos(TM)-U NPU identity operator
9326 - Switch PlanDevices pass to be w.r.t. SEScopes instead of DLDeviceTypes.
QNN - #11228, #10718, #10086, #10053, #9637, #9982

Runtime

11334 - [PipelineExecutor] Add graph manually splitting logic into the unit test.
11133 - [PipelineExecutor] Refactor PipelineExecutor.py and Add cross compile support for pipeline executor.
11172 - Move WrapTimeEvaluator from RPC to profiling, NFC
10990 - [PipelineExecutor]Add forwarding queue logic for set input.
10953 - [Vulkan] Add RGP support to TVM for vulkan device
10723 - [PipelineExecutor] Getting the asynchronous output
10283 - AOTExecutor implementation and c target code-generator
9802 - [ThreadPool]Refactor affinity function and support CPU affinity list setting.
10234 - [Pipeline Executor] multiple threads management and the data forwarding notification mechanism.
10326 - Improved log information with function signature
10032 - [PackedFunc] Bring PackedFunc into TVM Object System
10082 - [PipelineExecutor] Pipeline Executor Sequential execution
10010 - [PipelineExecutor] Add Pipeline Executor Interface
9846 - [Pipeline executor] Global parameters group name and runtime modules parameters map.
9889 - [GraphExecutor] Add API get_input_info to graph_executor
9751 - [Pipeline Executor] Add the map logic of global input and subgraph input.

TE

11589 - Support schedulable TIR compute definitions in TOPI
11341 - Optimized version of concatenation layer
10561 - [TECompiler] Decouple TE compute and schedule lowering in ScheduleBuilder

TIR

11592 - HoistExpression, generalization of HoistIfThenElse
11870 - [Pass] Remove-Weight-Layout-Rewrite-Block
11740 - [TIR, analysis] Add GetAutoTensorizeMappingInfo to generate transforms for auto tensorization
11585 - Add preserve-unit-iters
11677 - Register CUDA WMMA tensor intrinsics
11658 - [TIR, CUDA] Add pass to replace global to shared memory copy with cp.async
11624 - [Schedule] Allow named block and buffer arguments in Schedule
11628 - [PASS] Refactor a couple of TIR passes - BindTarget, AnnotateEntryFunc, Filter, LowerInitBlock
11574 - CSE pass : Restrict the equivalence to be decided by a normal form - avoids comparison of terms
11575 - Schedule Primitive: Add-Unit-Loop
11515 - Add schedule primitive ReIndex
11524 - [Arith] Additional Simplifications Inside Conditionals
11485 - Add schedule primitive TransformBlockLayout
11495 - [Software pipeline] Fix hardcoded index in access_ptr rewriting, add a GPU test with depth 4
11269 - [Schedule] Transform layout quality of life
11355 - Support tensorization using ldmatrix + MMA
11289 - [Schedule] Allowed typing.Tuple in tir.schedule._type_checker
11317 - Support affine expressions as indices in reverse compute inline
11235 - [Arith] Implemented padded inverses in IndexMap
11238 - [ROOFLINE] Calculate roofline from existing TIR PrimFunc
11225 - Add schedule primitive SetAxisSeparator
11110 - Get read/write access precisely for opaque access.
11106 - Enhance software pipeline validation and fix predicate of epilogue
10843 - StmtFunctor RenewDefs
11075 - Add function to tile a block according to a given tensor intrinsic
11050 - Utility function to decide loop mapping for auto tensorization
11009 - [ROCM] DP4A intrinsic support for TE/TIR
10925 - VNNI and ARM dot product intrinsic for tensorization
10887 - [Schedule] Relax reorder primitive's affine binding check
10732 - [Analysis] Add SuggestIndexMap for layout rewriting
10538 - [Schedule] Transform layout
10638 - Change the behavior of read/write region analysis for reduction blocks.
10705 - Use local complete block and local reduction block to identify compact dataflow
10671 - Tuple Reduction Support in CreatePrimFunc
9727 - [TE]Implement layout transformations, non-flat memory buffers
10405 - [TensorIR] Update VerifyGPU
10401 - [TensorIR] Renormalize split pattern
10112 - [TIR, Relay] improve bfloat16 support
8509 - Tir constants integration into compilation pipeline
9996 - add support for multi-blocking layout and their transformation
10066 - Add software pipelining
10207 - Support sub warp reduction for CUDA target.
9482 - Implementation of Common Subexpression Elimination for TIR
9527 - Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern
10158 - [Schedule] Update compact_dataflow constraint
9871 - [Schedule] Blockize and Tensorize
10016 - [BugFix]Fix cross-thread reduction when single reduction loop with predicate
9880 - Encode conditional accesses info into block read/write regions
9699 - Affine utility support iter lowerbound and diagnostics
9742 - [Schedule] Add Annotate/Unannotate primitive
9738 - [TensorIR] Primitive "SetScope"
9743 - [Schedule] Analysis functions to check if compute_inline and com…
9689 - Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs
9559 - [TensorIR][UX] Type annotation-based runtime type checking
9444 - Add a 'rolling_buffer' scheduling primitive
9360 - [TensorIR] Cross-Thread Reduction

TOPI

11531 - TE implementation of LSTM using scan
11161 - Add Adreno GPU target and topi supporting textures with dynamically allocated textures
10332 - VNNI support for batch matmul
9873 - Add support for groupped conv3d
10230 - VNNI support for int8 dense
10098 - [Op]5 ops can accept unsigned integers as indices
9832 - Support grouped conv1d
9694 - Add generic batch norm
9233 - Cortex-M DSP support

TVMScript

11308 - Represent ramp as index slice
10099 - Support T.buffer_decl using data pointer from Let/Allocate
9680 - Improve printer for TIR syntax sugar
9492 - Add syntax sugar for T.handle and T.match_buffer
9620 - Add for loop syntax sugar
9543 - Misc error message improvements
9505 - [Fix] Add type hints for more uncovered cases

USMP

11015 - U3 use case
10189 - Adding support for U1 usecase for constant pools
10785 - Adding support for U4 usecase
10193 - adding support for U2 and U3 usecases
10005 - Add performance characteristics to PoolInfo
9565 - [TIR]Integrating USMP to AoT Executor
9704 - Hill Climb allocator
9418 - [TIR]adding the pass to convert to pool offsets
9649 - [TIR]Augmenting the algo interface with memory pressure
9214 - [TIR]Greedy memory planning algorithm
8468 - [TIR]Added buffer info extraction pass

microNPU

11468 - Optimize separate padding operation for conv2d
11453 - Add transform matrices and part matcher to identity op
11410 - add E2E tests with cascader wo striping
11288 - Expose compute cycle annotations to TIR lowering
10959 - Add a pass to reorder copy and compute nodes
10509 - Add various options to the cascader
11263 - Adding a option to enable striping
10251 - Add support for conv2d running on two cores on U65
10862 - Integrate the cascader
10344 - Integrate rolling buffers in Arm(R) Ethos(TM)-U
10824 - Some housekeeping in the test_ethosu folder
10763 - Tweak a layout transform matrix
10725 - Add a pass to move allocate nodes to the outer scope
10695 - Determine block configs using the cascader
10599 - Refactor Relay to TIR hook
10508 - Improve cascader memory transfer estimates
10345 - Add support for TFLite FULLY_CONNECTED
10254 - Introduce a pass to remove redundant identity operations
10062 - [5] Convert Proposals to te.Schedules
9959 - [4] Add the cascader Proposal generator
10022 - enable USMP
10127 - Add support for LeakyReLU
10004 - Add FreeRTOS variant of NPU demo
10060 - Refactor type inference data type checks
9960 - Add support for pack and unpack
10143 - Fix layout assignment in layout optimizer pass
9890 - [3] Plan generation for the cascader
9855 - Add support for transpose convolution
9841 - Add support for nearest neighbor and bilinear upsampling
9951 - Removing constant args from PrimFunc
9929 - Refactor base address determination to codegen
9910 - Add support for requantize
9831 - Move optimization passes to be a module pass and ensure they are running
9785 - [2d] Add more Part matchers to cascader
9778 - [2c] Add performance modelling to cascader
9471 - [2b] Create CascaderGraphs from TE graphs
9469 - [2a] Add CascaderGraph for cascading analysis
9621 - Add support for SPLIT and SPLIT_V
9508 - Update Conv2D Tests to Use TF API to Gen Test Cases
9627 - Add support for SIGMOID
9589 - Add support for TFLite concatenate
9623 - Refactor codegen tests
9561 - Add NHWC -> NHCWB16 layout transformation pass
9576 - Mean legalization support
9597 - Move the compilation to use Target Hooks.
9458 - [1] Add affine analysis structures for the cascader
9547 - Add the infrastructure for lookup table and TANH
9521 - Support binary elementwise with non-4D inputs
9560 - Fix incorrectly calculated stride when converting NHWC to NHCWB16
9530 - Add unary elementwise operator infrastructure with ABS
9514 - Adding rounding mode attribute to operators
9515 - Allow constants to be given as input to an operator

microTVM

11250 - [ARM] Add Relay tests for conv2d registered schedules
11232 - [rpc] Implemented rpc logging
11044 - Add support for host-driven AoT Executor
11043 - Better version handling for Arduino
10555 - Enable micro tvmc tutorial testing in CI
10194 - [RVM] Add scripts for automated build and testing
10144 - TVMCon 2021 Zephyr Demo with CMSIS-NN
10024 - [tvmc] Add TVMC Micro tutorial for Zephyr
9684 - Fix zephye/test_zephyr_armv7m test
9584 - [TVMC] Add TVMC test for Arduino and Zephyr
9526 - Add minimal forwarding RPC server for host driven python execution on Hexagon
Zephyr support - #11362, #10138

Misc

11465 - Add cooldown interval logic for the profiling functional
11888 - [LLVM] Include LLVM headers in files that use them, not in llvm_common.h
11646 - [Arith] Simplification of ceil, log2, and left_shift
11464 - [MLF] Add support for multiple modules in Model Library Format
11632 - [AutoTVM][Autoscheduler] Default build funcs inherit PassContext
11543 - [OpenCL] Implement conv2d_winograd algorithm for Adreno
11287 - [Arith] Merge surjective/non-surjective iter mapping detections
11393 - Add utility to replace direct call to pytest.main
11252 - [ROOFLINE] Roofline analysis over RPC
11000 - [Graph Debugger] Expose way to benchmark individual nodes.
10794 - bump PyTorch version to 1.11
10821 - [REFACTOR] Remove legacy nnvm folder
10798 - [Arith] Remove diagnostic ctx argument from DetectIterMap
10567 - [Refactor] Reduced repetition in CodeGenLLVM's buffer access
10455 - [AUTO_SCHEDULER] Add feature extraction directly from PrimFunc
7401 - RFC: initial stab at TorchScript fallback
10391 - [vulkan] Add integer dot product (4xint8, 4xuint8) tensorization for the vulkan SPIR-V target.
10293 - [VirtualMachine] new method allowing to set one input tensor by its index or name
10191 - Generate correct output tensor names in C Interface API
9276 - Parameterize test_link_params
9808 - [Rust] Update Rust bindings
9553 - [PROFILING] Add ability to profile a single function_profiling
9611 - [CMAKE] Automatically detect newly added source files
9544 - [Target] enable -arch=sm_xx for assigning cuda target arch and deprecate autotvm.measure.set_cuda_target_arch api
Profiler - #11530, #11066
Docs - #10921, #11403, #10774, #10912, #9633, #9906, #9534, #9307, #9654, #9580
Android - #11241
ETHOSN - #11261, #10486, #10018, #9596
TVMC - #11012, #10962, #10722, #9817, #9529, #9229

apache / tvm

TVM v0.9.0.rc0 Release Candidate Notes #12102

Introduction

RFCs

What's Changed

AOT

11208 - Calculate used memory at the callsite of primitive functions

11365 - Fix function number datatype from char to uint16_t

11091 - Enable A-Normal Form in the AOT executor

10753 - Support LLVM backend with C++ runtime

10518 - Use python temporary directory for AOT tests

10337 - BugFix of workspace calculation

10282 - [runtime] Add Metadata classes for AOTExecutor

9501 - [3/3][DeviceAPI] Wire up cpacked Device API context

9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close

9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery

11144 - Remove support for run-time linked-params from codegen

10590 - Add order to functions in C Codegen

11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL

11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions

CI

11313 - Refactor of tvm.testing.requires_* annotations

11666 - Enable pylint for tests/python/ci

11657 - Apply linting rules to AOT tests

11380 - Restructure Jenkinsfile

Frontends

Hexagon

11549 - Initial clip operator for Hexagon

11834 - Add op resize2d for hexagon

11559 - Softmax slice op initial version

11529 - Slice ops added - add, subtract, multiply

11720 - [testing] add max_pool2d benchmark

11417 - Implement avg_pool2d slice op

11653 - Add HexagonThreadManager

11547 - Run single RPC server on Android in each testing session

11490 - [testing] add TVMScript elemwise-add

11400 - [testing] refactor benchmark-table code

11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures

11319 - Add unit tests for Hexagon Device API

11279 - Add USMP tests

11283 - Update Readme

11239 - capture gtest output and return over FFI

11175 - Add schedule and test for conv2d_transpose_nchw

11018 - [Runtime] Add QuRT thread pool backend

11145 - Add support for on-device unit testing using gtest

11138 - Add test for depthwise conv2d schedule

11016 - Add test for registered schedules

11104 - Add mobilenet test

11090 - Delete offload runtime, move files to right places

11065 - AoT with LLVM Codegen on Hexagon

11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON

10604 - HVX scheduling and bench-marking of TE element-wise add

10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout

10907 - Move aot/graph_executor interactions into launcher

10919 - Register basic strategies and schedules for common operators

10904 - Add unit tests executing 2-d VTCM usage

10910 - Refactor to keep HexagonBuffer private to the device api

10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU

10878 - Generalized HexagonBuffer::CopyTo/CopyFrom

10846 - Support both 1-d and 2-d VTCM allocations

10581 - Improved ergonomics of HexagonLauncher in unit tests.

10616 - Refactor tvm.contrib.hexagon, NFC

10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake

10586 - Codegen for 2d Load/Store

10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon

10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.

10507 - Add doc on TVM - Hexagon RPC flow

10520 - Resolve breakage in test_hexagon/test_cache_read_write

10311 - [runtime]AOTExecutor implementation for C Codegen

10454 - Allow execution on target or simulator from HexagonLauncher

10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize

10361 - RPC server/client for simulator

10302 - [CI]Add Hexagon Tests to pipeline

10263 - [Docker]Add docker file and scripts

10227 - Refactor Hexagon.cmake

10217 - Adding support for Hexagon User DMA Engine

10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc

9970 - Do not auto-build apps when building TVM