[RFC] Proposal to Graduate MXNet’s ONNX Support

Zha0q1 commented 3 years ago

What is ONNX and Why

ONNX, or Open Neural Network Exchange, is an open source deep learning model format that acts as a framework neutral graph representation between DL frameworks or between training and inference. With the ability to export models to the ONNX format, MXNet users can enjoy faster inference and a wider range of deployment device choices, including edge and mobile devices where MXNet installation may be constrained. Popular hardware-accelerated and/or cross-platform ONNX runtime frameworks include Nvidia TensorRT, Microsoft ONNXRuntime, Apple CoreML and TVM, etc.

There is a huge ecosystem revolving ONNX. More than 1,000 projects are built on top of ONNXRuntime according to this GitHub dependency page. It’s crucial to make MXNet an active participant of this thriving community.

The “Before” of ONNX Support in MXNet

The ONNX support for MXNet was first introduced in 2017-2018, with both the modules to export to and import from the ONNX format. However, we did not keep up with the latest ONNX format since that time and the format has iterated through several new op sets. While the MXNet community has always had demand, the current ONNX support (shipped in MXNet 1.6, 1.7, 1.8) is outdated and supplies little — most of MXNet users use the GluonCV and GluonNLP tool kits to train, fine-tune models or load pretrained models from the model zoo, but the majority of these models cannot be exported by the current ONNX support.

(The current model support can be found here, at the bottom of the page.)

The New Development on ONNX Support

Lately, We (Joe @josephevans, Wei @waytrue17, and I @Zha0q1 ) have been working on restoring and improving ONNX support by supporting the export of the most popular and state-of-the-art models. We are currently operating on the MXNet v1.x branch, which is compatible with the latest few GluonCV/NLP releases that most users have installed. So far, we have supported 90% of all the (180+) GluonCV pretrained models, and 90% of the exports have also been verified to produce consistent outputs with that of native MXNet. For GluonNLP, we have added export support for RNN, Bert (and alike), GPT, and Transformer models. We have also worked on highly-requested features such as dynamic input shapes and graph optimization. Our overall goal is a “train on Apache MXNet and deploy anywhere" user experience which offers the most inference flexibility.

We plan to release the new ONNX support in the next v1.x (1.9) version.

Proposal Summary

Now that the ONNX support is mature, we would like to graduate the ONNX support from the mx.contrib namespace into an official and stable MXNet feature. We believe this will help us best publicize the work and serve the needs of interested users. Below is a summary of the graduation tasks, and each point is explained in detail in the following paragraphs.

Graduate the MX2ONNX module (exporting to ONNX format) from the mx.contrib.onnx namespace to a regular and shallower directory (such as mxnet.mx2onnx).
Deprecate the ONNX2MX module (importing from ONNX format)
Only support ONNX 1.7, 1.8, and future releases for MX2ONNX.
Add a setup.py for MX2ONNX.
Up to date, accurate, and better documentation.

Graduate MX2ONNX from contrib

Currently MX2ONNX hides very deep in the mxnet.contrib namespace. As the ONNX export support is maturing and stabilizing quickly, we should graduate it from the experimental “contrib” folder into a shallower namespace (such as mxnet.mx2onnx). This way we can better promote it as an official feature and users will try it out with more trust. Also, we can add readme/doc files to the new MX2ONNX directory explaining the ONNX compatibility, APIs, operator support, model support, etc. Those documentations can be updated with each feature or bug fix commit and always stay up to date. We can easily point MXNet users interested in ONNX to this new directory or they can google into it. For reference, PyTorch’s ONNX support is in torch.onnx and tf2onnx has its own repository. Both of them are the first search result on Google while MXNet’s ONNX support is not in the first 3 pages (key worlds are [Framework name], ONNX, GitHub).

We should graduate MX2ONNX in the next v1.x (1.9) release. Because the “contrib” namespace generally does not have any backward compatibility promise, we propose to move the MX2ONNX files entirely to the new directory incubator-mxnet/python/mx2onnx. We can keep a dummy python file in the old mxnet.contrib.onnx.mx2onnx directory with only the API definitions. When user call the APIs in through the old path they will get an error about that path being deprecated and what the new directory is. We can then remove this dummy file in the next next (1.10) release.

Deprecate the ONNX2MX Module

The ONNX2MX module (importing from ONNX format) was introduced in 2017 when MXNet had more performant deployment solutions than other frameworks, better distributed training story, and more language supports. However, this has changed since then and importing from ONNX is no longer a requested use case. As we are focusing on making exporting to ONNX for deployment a smooth experience, we should deprecate the unrequested and under invested ONNX2MX module. The effort should include clearing the python files, test cases, and relevant documentations and tutorials.

We propose: 1) keep the ONNX2MX module in the next (1.9) release and add a deprecation warning to the APIs. 2) remove all ONNX2MX related files, as mentioned above, in the next next (1.10) release.

Only support ONNX 1.7 and Onward for MX2ONNX.

ONNX generally releases bi-yearly. ONNX 1.7 was released in May 2020 and 1.8 in Nov 2020. With each new ONNX version a new op set is released to either add new operators or revise the configuration of the existing ones. We propose to only support 1.7 and onward to make the development more focused (this way we do not need to spend extra time on implementations for the same MXNet operator for earlier ONNX op sets). This won’t be a blocker for model deployment as the inference frameworks generally add support for the latest ONNX very quickly. ONNXRuntime always support the latest ONNX right away and the latest TensorRT currently supports up to ONNX 1.7.

We plan to continue to support new ONNX versions after 1.8. Existing operator and model tests can help validate the new implementations of the operators that have a updated specification in the new op set. Users can choose to upgrade to the new ONNX version or they can stay at the current version if there is no need to upgrade. ONNX runtime frameworks are generally backward compatible with all previously op sets and models generated based on them.

Add a setup.py to MX2ONNX

We propose to create a setup.py so that users of earlier MXNet versions, especially those who cannot easily upgrade to the newest MXNet, can also enjoy the latest ONNX support by pulling the next release branch (v1.9) and doing a pip install locally. After installing through setup.py, users should be able to do import mx2onnx and make API calls through the mx2onnx name. Because MX2ONNX only relies on MXNet for type and shape inference, this separate installation should work with any MXNet version as along as the model itself is compatible with that version.

In our website, we should instruct the users to always try to pull from the release branches (v.1.9 and onward). Users should be told to use use discretion when pulling from development branches (v1.x) for the latest and unreleased ONNX support.

Better Documentation

We will need to get rid of the existing documentations and tutorials as they are outdated. New documents on MXNet website should include:

MX2ONNX APIs
Tutorial on exporting to ONNX
Tutorial to get the ONNX model to work on ONNXRuntime and TensorRT

In the MX2ONNX directory, we will need to have readme files on:

Compatible ONNX, ONNXRuntime versions
MX2ONNX APIs
Operator support matrix
Gluon CV/NLP model zoo support matrix
Tutorial for pip installing MX2ONNX

Foward-port the Same Changes to MXNet 2.0

At this time we are prioritizing supporting the needs of current MXNet and Gluon CV/NLP users. When MXNet 2.0 compatible Gluon CV/NLP are stabilized we will forward port the same changes as proposed above.

szha commented 3 years ago

Nice work! I think we can still keep an alias of the module in the original namespace just so that whoever is using it won't be broken.

Also, name-wise, mx.onnx seems more succinct to me, though it's not a strong preference.

Zha0q1 commented 3 years ago

Nice work! I think we can still keep an alias of the module in the original namespace just so that whoever is using it won't be broken.

Also, name-wise, mx.onnx seems more succinct to me, though it's not a strong preference.

Thanks, I think both are good suggestions!

waytrue17 commented 2 years ago

One thing to note: the old mx-onnx module was tested against onnx 1.3. Thus, one should use onnx 1.3 with MX before 1.9, especially for the mx-onnx import functionalities.

apache / mxnet