[Benchmark] Improve NLP Backbone Benchmark

Description

In GluonNLP, we introduced the benchmarking script in https://github.com/dmlc/gluon-nlp/tree/master/scripts/benchmarks.

The goal is to track the training + inference latency of common NLP backbones so that we can choose the appropriate ones for our task. This will help users train + deploy models with AWS.

Currently, we covered:

Huggingface/Transformer-based backbone with FP32 + FP16 training / inference. For FP16 training, we are not profiling against the AMP-based solution so this gives an edge of pytorch, in which we need to fix
MXNet 2.0-nightly version (only for community use) + GluonNLP 1.0 with FP32 + FP16 (amp) training / inference.
TVM FP32 inference. Due to some recent upgrade of the code base, this is currently broken.

I will share the following action items that I feel are worthwhile doing:

Short-term Bug-fix + Improvement

[ ] Fix the FP16 training benchmark in Huggingface/Transformer to use AMP in PyTorch
[ ] Fix the TVM benchmark. This is also tracked in https://github.com/dmlc/gluon-nlp/issues/1425
[ ] Add FP16 inference to TVM benchmark.
[ ] Turn on einsum acceleration in MXNet-based benchmark. This is added in https://github.com/apache/incubator-mxnet/pull/18921

Automation + Visualization

[x] Support launching benchmark job with AWS Batch. Currently tracked in https://github.com/dmlc/gluon-nlp/pull/1471.
[ ] Automate benchmarking process via Github actions.
[ ] Support visualization of benchmark results

Longer-term Backbone Benchmarking Effort

[ ] Add JAX/flax-based solution, which is internally using XLA.
[ ] Support AutoScheduler in TVM benchmark
[ ] Enable ONNX + TensorRT. This is considered the fastest solution for conducting NLP inference.

Other longer-term efforts

[ ] Support benchmarks for Data-loaders.
[ ] Support common end-to-end training benchmarks like the SQuAD 2.0 finetuning. We may focus on single-instance-based benchmarks.

@dmlc/gluon-nlp-committers

dmlc / gluon-nlp

[Benchmark] Improve NLP Backbone Benchmark #1473

Description

Short-term Bug-fix + Improvement

Automation + Visualization

Longer-term Backbone Benchmarking Effort

Other longer-term efforts