Enable ORT accuracy tests to verify int8

rocMLIR will be added to migraphx. This will be for all the data types but this issue will be to ensure existing tests in DLM pass using int8

Seems like #2300 will aid in accuracy for these runs.

Reopenning, need to rerun testing with these changes + update ORT EP

Resnet50 runs, need to do further analysis to compare to fp16 through same pipeline.

Plan of attack

[x] Add changes to DLM to run e2e pipeline with int8 <- has accuracy result using real dataset
[x] Storage for imagenet data set for DLM runs
[x] Add test for bert
[x] Add test for distilgpt

Added models will leverage existing e2e code, may need to write/borrow code for benchmark.py thats used in parity checks.

Doing this part of QA validation for the resnet50 pipeline added in onnxruntime-inference-examples

Got DLM changes for benchmark.py but failing off 6.0 sorting out issues with run scripts.

Able to get run for bert-large, bert-based-cased and distilgpt2 model runs into DLM reusing existing runs. Missing GPT2 as referenced in #1905 . Will need to add gpt2 + requivalent int8 run.

We're letting onnxruntime do the quantization of the model before we do a run through the MIGraphX EP right now.

Running these by hand to verify

bert_large_uncased int8 Quantized.

Finished quantizing model: ./onnx_models/bert_large_uncased_1_int8_gpu.onnx
Run onnxruntime on bert-large-uncased with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-large-uncased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:36:35.944933', 'test_times': 100, 'latency_variance': '0.13', 'latency_90_percentile': '175.18', 'latency_95_percentile': '175.84', 'latency_99_percentile': '211.92', 'average_latency_ms': '176.60', 'QPS': '5.66'}
Run onnxruntime on bert-large-uncased with input shape [16, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-large-uncased', 'inputs': 1, 'threads': 16, 'batch_size': 16, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:36:54.211083', 'test_times': 100, 'latency_variance': '1164.47', 'latency_90_percentile': '5226.89', 'latency_95_percentile': '5381.81', 'latency_99_percentile': '5932.76', 'average_latency_ms': '3711.79', 'QPS': '4.31'}

bert_based_cased int8 Quantized

Finished quantizing model: ./onnx_models/bert_base_cased_1_int8_gpu.onnx
Run onnxruntime on bert-base-cased with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:15.985903', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '12.94', 'latency_95_percentile': '15.32', 'latency_99_percentile': '15.75', 'average_latency_ms': '12.85', 'QPS': '77.80'}
Run onnxruntime on bert-base-cased with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:17.620586', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '54.57', 'latency_95_percentile': '54.61', 'latency_99_percentile': '55.08', 'average_latency_ms': '54.20', 'QPS': '18.45'}
Run onnxruntime on bert-base-cased with input shape [32, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 32, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:23.099155', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '127.41', 'latency_95_percentile': '127.57', 'latency_99_percentile': '128.02', 'average_latency_ms': '126.68', 'QPS': '252.61'}
Run onnxruntime on bert-base-cased with input shape [32, 384]

distilgpt2 int8 Quantized

Size of quantized ONNX model(MB):116.36144828796387
Finished quantizing model: ./onnx_models/distilgpt2_1_int8_gpu.onnx
Run onnxruntime on distilgpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:43.699617', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '8.07', 'latency_95_percentile': '10.20', 'latency_99_percentile': '10.26', 'average_latency_ms': '8.14', 'QPS': '122.79'}
Run onnxruntime on distilgpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:44.836106', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '45.96', 'latency_95_percentile': '45.97', 'latency_99_percentile': '46.01', 'average_latency_ms': '45.78', 'QPS': '21.84'}
Run onnxruntime on distilgpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:49.480584', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '32.90', 'latency_95_percentile': '32.95', 'latency_99_percentile': '33.15', 'average_latency_ms': '32.70', 'QPS': '244.62'}
Run onnxruntime on distilgpt2 with input shape [8, 384]

Got a gpt2 run with int8 quant here.

quantized model saved to:./onnx_models/gpt2_1_int8_gpu.onnx
Size of quantized ONNX model(MB):157.34468364715576
Finished quantizing model: ./onnx_models/gpt2_1_int8_gpu.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:05.271922', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '17.57', 'latency_95_percentile': '17.61', 'latency_99_percentile': '17.68', 'average_latency_ms': '15.39', 'QPS': '65.00'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:07.146352', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '72.94', 'latency_95_percentile': '73.17', 'latency_99_percentile': '73.99', 'average_latency_ms': '72.80', 'QPS': '13.74'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:14.528135', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '52.48', 'latency_95_percentile': '52.50', 'latency_99_percentile': '52.58', 'average_latency_ms': '52.30', 'QPS': '152.97'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:19.830109', 'test_times': 100, 'latency_variance': '0.01', 'latency_90_percentile': '529.51', 'latency_95_percentile': '530.92', 'latency_99_percentile': '544.40', 'average_latency_ms': '529.13', 'QPS': '15.12'}
Fusion statistics is saved to csv file: benchmark_fusion_20231206-205813.csv
Detail results are saved to csv file: /tmp/results.csv
Summary results are saved to csv file: benchmark_summary_20231206-205813.csv

changes pushed into #2468

Not seeing proper code path when running trace compile for tests in DLM with MIGRAPHX_TRACE_EVAL=1 after investigating large drop in outputs compared to fp16 versions. Sorting this out before I close it out.

Seeing about an order of magnitude drop between fp16 and int8 runs eg) distilgpt2 below

{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:38.877793', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.61', 'latency_95_percentile': '10.79', 'latency_99_percentile': '11.09', 'average_latency_ms': '10.57', 'QPS': '94.61'}
Run onnxruntime on distilgpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:40.258887', 'test_times': 100, 'latency_variance': '0.19', 'latency_90_percentile': '80.93', 'latency_95_percentile': '81.12', 'latency_99_percentile': '81.47', 'average_latency_ms': '61.06', 'QPS': '16.38'}
Run onnxruntime on distilgpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:46.467054', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '36.39', 'latency_95_percentile': '36.77', 'latency_99_percentile': '36.95', 'average_latency_ms': '35.85', 'QPS': '223.17'}
Run onnxruntime on distilgpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:50.101417', 'test_times': 100, 'latency_variance': '1.09', 'latency_90_percentile': '362.16', 'latency_95_percentile': '409.24', 'latency_99_percentile': '554.73', 'average_latency_ms': '363.10', 'QPS': '22.03'}

{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:06.612680', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.03', 'latency_95_percentile': '1.03', 'latency_99_percentile': '1.04', 'average_latency_ms': '1.01', 'QPS': '987.05'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:23.724592', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.03', 'latency_95_percentile': '2.04', 'latency_99_percentile': '2.11', 'average_latency_ms': '2.00', 'QPS': '500.15'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:01.631328', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.02', 'latency_95_percentile': '2.13', 'latency_99_percentile': '2.16', 'average_latency_ms': '1.97', 'QPS': '4051.12'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:19.493161', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.90', 'latency_95_percentile': '10.92', 'latency_99_percentile': '11.03', 'average_latency_ms': '10.72', 'QPS': '746.60'}

It appears we're bouncing between all 3 EPs when doing int8 runs.

Attempting to run the model in the driver I'm seeing the following:

terminate called after throwing an instance of 'migraphx::version_2_8_0::exception'
  what():  /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/AMDMIGraphX/src/onnx/onnx_parser.cpp:417: parse_graph: Unknown operator: DynamicQuantizeLinear
Aborted (core dumped)

Which would most likely be why in the EP we don't find that OP and then fallback to the rest of the nodes put onto the other EPs

Hmm we added support for that operator

Looks like we're using an older version.

MIGraphX Version: 2.8.0.7f8f0fd0f

Also DynamicQuantizeLinear isn't in the ep list of OPs. Have a patch in comming to add it in.

Got fixes up to here:

Upstream - https://github.com/microsoft/onnxruntime/pull/18798 Internal - https://github.com/ROCmSoftwarePlatform/onnxruntime/pull/26

Running a test with latest develop + change in onnxruntime + dlm container for the gpt2 test for int8. Will try to run the other models and analyze once complete as well.

Using this to build the end to end pipeline

python3 tools/run_models.py --tags migx_onnxrt_gpt2_quant_benchmarks --liveOutput --cleanDockerCache                               --additionalContext "{'guest_os':'UBUNTU', \
                              'docker_build_arg':{\
                              'BASE_DOCKER':'compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.0:88_ubuntu22.04_py3.10_pytorch_release-2.1_011de5c', \      
                              'ORT_UNIT_TESTS':'false', 'ORT_BUILD':'true', 'ONNXRUNTIME_BRANCH':'add_dynamic_quantize_linear','ONNXRUNTIME_REPO':'https://github.com/ROCmSoftwarePlatform/onnxruntime', 'MIGX_BUILD':'true'}}"

@causten for the MIGraphX side, looking at develop, your right, we should have this op in. Need to figure out where APT is getting things here.

Build off develop seems to work to read the int8 model correctly

@2689 = @return(@2688,@1075,@1077,@1211,@1213,@1347,@1349,@1483,@1485,@1619,@1621,@1755,@1757,@1891,@1893,@2027,@2029,@2163,@2165,@2299,@2301,@2435,@2437,@2571,@2573), target_id=0

[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver read gpt2_1_int8_gpu.onnx

[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver run gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids

From the ORT benchmark test the fp16 run got

{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:06.612680', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.03', 'latency_95_percentile': '1.03', 'latency_99_percentile': '1.04', 'average_latency_ms': '1.01', 'QPS': '987.05'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:23.724592', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.03', 'latency_95_percentile': '2.04', 'latency_99_percentile': '2.11', 'average_latency_ms': '2.00', 'QPS': '500.15'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:01.631328', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.02', 'latency_95_percentile': '2.13', 'latency_99_percentile': '2.16', 'average_latency_ms': '1.97', 'QPS': '4051.12'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:19.493161', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.90', 'latency_95_percentile': '10.92', 'latency_99_percentile': '11.03', 'average_latency_ms': '10.72', 'QPS': '746.60'}

edit rerunning this test off develop + latest changes for FP16 I get the following timings. which is 50% lower than previous. My understanding here is that we have fast math disabled on for both cases so that shouldn't be the cause..


Model saved to ./onnx_models/gpt2_1_fp16_gpu.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 03:06:14.856181', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.83', 'latency_95_percentile': '1.84', 'latency_99_percentile': '1.87', 'average_latency_ms': '1.82', 'QPS': '549.61'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-13 03:06:42.573559', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '3.63', 'latency_95_percentile': '3.64', 'latency_99_percentile': '3.67', 'average_latency_ms': '3.55', 'QPS': '282.03'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 03:07:27.814779', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '3.41', 'latency_95_percentile': '3.66', 'latency_99_percentile': '3.71', 'average_latency_ms': '3.38', 'QPS': '2368.26'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-13 03:07:56.767108', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '18.09', 'latency_95_percentile': '18.24', 'latency_99_percentile': '18.71', 'average_latency_ms': '17.62', 'QPS': '454.15'}

Trying a perf run with only int8 on in the driver we're seeing a value on our end that's only

gpu::code_object::quantizelinear_kernel: 1.37962ms / 60 = 0.0229937ms, 21%
gpu::code_object::contiguous_kernel: 0.835693ms / 36 = 0.0232137ms, 13%
gpu::code_object::mlir_reshape_quant_dot: 0.755034ms / 23 = 0.0328275ms, 12%
gpu::code_object::layernorm_mul_add_quantizelinear_kernel: 0.581659ms / 24 = 0.0242358ms, 9%
gpu::code_object::dequantizelinear_add_add_kernel: 0.525061ms / 23 = 0.0228287ms, 8%
gpu::code_object::mlir_reshape_quant_dot_dequantizelinear_add: 0.387697ms / 13 = 0.0298228ms, 6%
gpu::code_object::mlir_quant_dot: 0.338115ms / 12 = 0.0281762ms, 6%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_quantizelinear_kernel: 0.309476ms / 12 = 0.0257897ms, 5%
gpu::code_object::softmax_kernel: 0.289521ms / 12 = 0.0241267ms, 5%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul_where: 0.283153ms / 12 = 0.0235961ms, 5%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.279093ms / 12 = 0.0232578ms, 5%
load: 0.119553ms / 219 = 0.000545903ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.117032ms / 1 = 0.117032ms, 2%
multibroadcast: 0.111057ms / 98 = 0.00113324ms, 2%
hip::hip_copy_literal: 0.0854196ms / 149 = 0.000573286ms, 2%
reshape_lazy: 0.0637735ms / 95 = 0.0006713ms, 1%
transpose: 0.0545841ms / 48 = 0.00113717ms, 1%
slice: 0.0419648ms / 36 = 0.00116569ms, 1%
gpu::code_object::add_layernorm_quantizelinear_kernel: 0.0241358ms / 1 = 0.0241358ms, 1%
gpu::code_object::gather_kernel: 0.0231269ms / 1 = 0.0231269ms, 1%
gpu::code_object::add_kernel: 0.0230521ms / 1 = 0.0230521ms, 1%
gpu::code_object::convert_kernel: 0.0227299ms / 1 = 0.0227299ms, 1%
@param: 0.00901038ms / 26 = 0.000346553ms, 1%
hip::hip_allocate_memory: 0.0008224ms / 1 = 0.0008224ms, 1%
check_context::migraphx::gpu::context: 0.00066418ms / 1 = 0.00066418ms, 1%

Batch size: 1
Rate: 585.383 inferences/sec
Total time: 1.70828ms
Total instructions time: 6.66105ms
Overhead time: 0.18396ms, -4.95276ms
Overhead: 11%, -290%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

Soley fp16 run gives us

Summary:
gpu::code_object::mlir_reshape_dot: 0.826736ms / 23 = 0.0359451ms, 16%
gpu::code_object::convert_kernel: 0.57429ms / 25 = 0.0229716ms, 11%
gpu::code_object::layernorm_mul_add_kernel: 0.566644ms / 24 = 0.0236102ms, 11%
gpu::code_object::contiguous_kernel: 0.533037ms / 24 = 0.0222099ms, 10%
gpu::code_object::add_add_kernel: 0.514395ms / 23 = 0.022365ms, 10%
gpu::code_object::mlir_reshape_dot_add: 0.391352ms / 13 = 0.030104ms, 8%
gpu::code_object::mlir_transpose_reshape_dot: 0.324614ms / 12 = 0.0270511ms, 6%
gpu::code_object::add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.300936ms / 12 = 0.025078ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_reshape_slice_transpose_dot_mul_where: 0.276505ms / 12 = 0.023042ms, 6%
gpu::code_object::softmax_kernel: 0.275471ms / 12 = 0.0229559ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_dot: 0.27312ms / 12 = 0.02276ms, 6%
gpu::code_object::mlir_dot_add_convert: 0.148124ms / 1 = 0.148124ms, 3%
multibroadcast: 0.0958457ms / 98 = 0.000978017ms, 2%
load: 0.0918247ms / 171 = 0.000536986ms, 2%
hip::hip_copy_literal: 0.0814319ms / 149 = 0.000546523ms, 2%
reshape_lazy: 0.0574439ms / 83 = 0.000692095ms, 2%
slice: 0.0288253ms / 24 = 0.00120106ms, 1%
gpu::code_object::add_layernorm_kernel: 0.0230485ms / 1 = 0.0230485ms, 1%
gpu::code_object::gather_kernel: 0.0226782ms / 1 = 0.0226782ms, 1%
gpu::code_object::add_kernel: 0.0221361ms / 1 = 0.0221361ms, 1%
transpose: 0.0186464ms / 24 = 0.000776932ms, 1%
@param: 0.0092624ms / 26 = 0.000356246ms, 1%
hip::hip_allocate_memory: 0.0007636ms / 1 = 0.0007636ms, 1%
check_context::migraphx::gpu::context: 0.0006462ms / 1 = 0.0006462ms, 1%

Batch size: 1
Rate: 624.128 inferences/sec
Total time: 1.60223ms
Total instructions time: 5.45778ms
Overhead time: 0.148524ms, -3.85554ms
Overhead: 9%, -241%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --fp16

With mixed int8 and fp16 we get the following

Summary:
gpu::code_object::quantizelinear_kernel: 1.38454ms / 60 = 0.0230757ms, 20%
gpu::code_object::contiguous_kernel: 0.819941ms / 36 = 0.0227761ms, 12%
gpu::code_object::mlir_reshape_quant_dot: 0.759796ms / 23 = 0.0330346ms, 11%
gpu::code_object::convert_kernel: 0.597603ms / 25 = 0.0239041ms, 9%
gpu::code_object::layernorm_mul_add_quantizelinear_kernel: 0.576765ms / 24 = 0.0240319ms, 8%
gpu::code_object::dequantizelinear_add_add_kernel: 0.526679ms / 23 = 0.0228991ms, 8%
gpu::code_object::mlir_reshape_quant_dot_dequantizelinear_add: 0.388696ms / 13 = 0.0298997ms, 6%
gpu::code_object::mlir_quant_dot: 0.339814ms / 12 = 0.0283178ms, 5%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_quantizelinear_kernel: 0.311157ms / 12 = 0.0259297ms, 5%
gpu::code_object::softmax_kernel: 0.286588ms / 12 = 0.0238823ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul_where: 0.28528ms / 12 = 0.0237733ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.282882ms / 12 = 0.0235735ms, 4%
load: 0.139606ms / 243 = 0.000574509ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_add_convert: 0.11139ms / 1 = 0.11139ms, 2%
multibroadcast: 0.110408ms / 98 = 0.00112661ms, 2%
hip::hip_copy_literal: 0.0868224ms / 149 = 0.000582701ms, 2%
reshape_lazy: 0.0709634ms / 95 = 0.000746983ms, 1%
transpose: 0.054488ms / 48 = 0.00113517ms, 1%
slice: 0.0448662ms / 36 = 0.00124628ms, 1%
gpu::code_object::add_kernel: 0.0239382ms / 1 = 0.0239382ms, 1%
gpu::code_object::add_layernorm_quantizelinear_kernel: 0.0238166ms / 1 = 0.0238166ms, 1%
gpu::code_object::gather_kernel: 0.0233278ms / 1 = 0.0233278ms, 1%
@param: 0.0093938ms / 26 = 0.0003613ms, 1%
hip::hip_allocate_memory: 0.0007862ms / 1 = 0.0007862ms, 1%
check_context::migraphx::gpu::context: 0.0006766ms / 1 = 0.0006766ms, 1%

Batch size: 1
Rate: 455.81 inferences/sec
Total time: 2.1939ms
Total instructions time: 7.26023ms
Overhead time: 0.193532ms, -5.06633ms
Overhead: 9%, -231%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --fp16 --int8

After building in DynamicQuantizeLinear into MIGraphX EP + newer develop we're seeing the following.


un onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 02:18:41.520488', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '17.70', 'latency_95_percentile': '17.72', 'latency_99_percentile': '17.74', 'average_latency_ms': '17.60', 'QPS': '56.81'}
``

QPS: 56.81 which is almost half of the initial int8 run.

I think it is related to fast math.

Running fp16 on our driver as a baseline

Summary:
gpu::code_object::mlir_reshape_dot: 0.764539ms / 23 = 0.0332408ms, 14%
gpu::code_object::convert_kernel: 0.593802ms / 25 = 0.0237521ms, 11%
gpu::code_object::layernorm_mul_add_kernel: 0.570796ms / 24 = 0.0237832ms, 11%
gpu::code_object::contiguous_kernel: 0.538753ms / 24 = 0.022448ms, 10%
gpu::code_object::add_add_kernel: 0.51837ms / 23 = 0.0225378ms, 10%
gpu::code_object::mlir_reshape_dot_add: 0.387976ms / 13 = 0.0298443ms, 8%
gpu::code_object::mlir_transpose_reshape_dot: 0.331672ms / 12 = 0.0276394ms, 7%
gpu::code_object::add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.304019ms / 12 = 0.0253349ms, 6%
gpu::code_object::softmax_kernel: 0.28573ms / 12 = 0.0238108ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_reshape_slice_transpose_dot_mul_where: 0.285179ms / 12 = 0.0237649ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_dot: 0.285079ms / 12 = 0.0237566ms, 6%
gpu::code_object::mlir_dot_add_convert: 0.148713ms / 1 = 0.148713ms, 3%
multibroadcast: 0.11099ms / 98 = 0.00113255ms, 3%
load: 0.0965821ms / 171 = 0.000564807ms, 2%
hip::hip_copy_literal: 0.0866334ms / 149 = 0.000581432ms, 2%
reshape_lazy: 0.0606654ms / 83 = 0.000730908ms, 2%
slice: 0.0382431ms / 24 = 0.00159346ms, 1%
gpu::code_object::add_layernorm_kernel: 0.0234014ms / 1 = 0.0234014ms, 1%
gpu::code_object::gather_kernel: 0.0232851ms / 1 = 0.0232851ms, 1%
gpu::code_object::add_kernel: 0.0230552ms / 1 = 0.0230552ms, 1%
transpose: 0.0194102ms / 24 = 0.00080876ms, 1%
@param: 0.00970014ms / 26 = 0.000373082ms, 1%
hip::hip_allocate_memory: 0.0007364ms / 1 = 0.0007364ms, 1%
check_context::migraphx::gpu::context: 0.0006434ms / 1 = 0.0006434ms, 1%

Batch size: 1
Rate: 562.497 inferences/sec
Total time: 1.77779ms
Total instructions time: 5.50797ms
Overhead time: 0.151713ms, -3.73019ms
Overhead: 9%, -210%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_fp16_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --fp16

which similarly to the fp16 run through the ep is around 549 QPS and in the EP we have fast math default off right now due to the accuracy issue we saw previously.

Running this off latest develop I'm seeing this error now when trying to run latest int8 model. Rolling back support when we added DynamicQuantizeLinear added into the opset from two days ago.

@1010 = convert[target_type=4](@1009) -> uint8_type, {1}, {1}, target_id=0
@1011 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1005) -> float_type, {1, 768}, {0, 0}, target_id=0
@1012 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1010) -> uint8_type, {1, 768}, {0, 0}, target_id=0
@1013 = quantizelinear(@999,@1011,@1012) -> uint8_type, {1, 768}, {768, 1}, target_id=0
@1014 = mul(@1005,@858) -> float_type, {1}, {1}, target_id=0

terminate called after throwing an instance of 'migraphx::version_1::exception'
  what():  /workspace/migraphx/src/src/include/migraphx/check_shapes.hpp:210: same_type: quant_dot: Types do not match
Aborted (core dumped)
root@aus-navi3x-02:/workspace/onnxruntime/build/Linux/Release/onnxruntime/transformers/onnx_models# cd /workspace/migraphx/src/

I think this is related to migraphx::shape::uint8_type cropping up in the output. If I losen the same_type constraint to only the fp8 types in quant_dot for now I can get gpt2 to read correctly like before.

It appears we're not adding either uint8 as part of our supported types or we're hitting a case where a convert that convert listed above is happening after we perform quant so when we go to compute_shape() we fail.

I'll add an eliminate_data_type pass and see if this helps to convert uint8->int8 albeit I think we'd need to be concerned about narrowing here for accuracy but currently building something to test this

Runs seem to still fail. without a the elminate_data_type pass for uint8, I'm getting robBlas failures

With the added pass I get

Reading: gpt2_1_int8_gpu.onnx
terminate called after throwing an instance of 'migraphx::version_1::exception'
  what():  /workspace/AMDMIGraphX/src/targets/gpu/mlir.cpp:706: run_high_level_pipeline: Invalid MLIR created: Error: 'migraphx.dot' op operand #0 must be !migraphx.shaped of 32-bit float or 16-bit float or bfloat16 type values, but got '!migraphx.shaped<32x768xi8, 768x1>'
Note: see current operation: %0 = "migraphx.dot"(%arg0, %arg1) : (!migraphx.shaped<32x768xi8, 768x1>, !migraphx.shaped<768x2304xi8, 2304x1>) -> !migraphx.shaped<32x2304xi8, 2304x1>

Aborted (core dumped)

Pushed up changes to debug_quant_dot branch. May try a later ROCm build since I'm still using build 88 from the 6.0 release cycle.

Tried some more stuff taking a look at quantizelinear which default returns to uint8 if we only have two input ops. Messing with that seems to also break witht the same mlir result above.

running the following after a read gives me this output where we're seeing the uint8 type popping up

migraphx-driver read gpt2_1_int8_gpu.onnx | grep quant_dot -b25 | head -26

83303-@1359 = gather[axis=0](@375,@1074) -> int64_type, {1}, {0}, target_id=0
83375-@1360 = gather[axis=0](@374,@1073) -> int64_type, {1}, {0}, target_id=0
83447-@1361 = slice[axes={0},starts={-1},ends={9223372036854775807}](@373) -> int64_type, {1}, {1}, target_id=0
83553-@1362 = unsqueeze[axes={0},steps={}](@1359) -> int64_type, {1}, {1}, target_id=0
83634-@1363 = unsqueeze[axes={0},steps={}](@1360) -> int64_type, {1}, {1}, target_id=0
83715-@1364 = squeeze[axes={0}](@1361) -> int64_type, {1}, {0}, target_id=0
83785-@1365 = concat[axis=0](@1362,@1363,@1068) -> int64_type, {3}, {1}, target_id=0
83864-@1366 = unsqueeze[axes={0},steps={}](@1364) -> int64_type, {1}, {1}, target_id=0
83945-@1367 = concat[axis=0](@1069,@1366) -> int64_type, {2}, {1}, target_id=0
84018-@1368 = reshape[dims={-1, 768}](@1358) -> float_type, {1, 768}, {768, 1}, target_id=0
84104-@1369 = reshape[dims={768}](@1368) -> float_type, {768}, {1}, target_id=0
84178-@1370 = concat[axis=0](@1369,@372) -> float_type, {769}, {1}, target_id=0
84252-@1371 = reduce_max[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84325-@1372 = reduce_min[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84398-@1373 = sub(@1371,@1372) -> float_type, {1}, {1}, target_id=0
84460-@1374 = div(@1373,@371) -> float_type, {1}, {1}, target_id=0
84521-@1375 = sub(@370,@1372) -> float_type, {1}, {1}, target_id=0
84582-@1376 = div(@1375,@1374) -> float_type, {1}, {1}, target_id=0
84644-@1377 = clip(@1376,@370,@369) -> float_type, {1}, {1}, target_id=0
84711-@1378 = nearbyint(@1377) -> float_type, {1}, {1}, target_id=0
84773-@1379 = convert[target_type=4](@1378) -> uint8_type, {1}, {1}, target_id=0
84848-@1380 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1374) -> float_type, {1, 768}, {0, 0}, target_id=0
84958-@1381 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1379) -> uint8_type, {1, 768}, {0, 0}, target_id=0
85068-@1382 = quantizelinear(@1368,@1380,@1381) -> uint8_type, {1, 768}, {768, 1}, target_id=0
85157-@1383 = mul(@1374,@1227) -> float_type, {1}, {1}, target_id=0
85219:@1384 = quant_dot(@1382,@1225) -> int32_type, {1, 2304}, {2304, 1}, target_id=0

Looks like this uint8 is sneaking in from how we handle dynamicQuantize linear. Not sure why we're assuming the type is supposd to be uint8 here instead of int8. Will need to investigate further next week.

Changing the output target type for zero point to uint type around line 140 in parse_dynamicquantizelinear seems to fix inserting uint8 now

83303-@1359 = gather[axis=0](@375,@1074) -> int64_type, {1}, {0}, target_id=0
83375-@1360 = gather[axis=0](@374,@1073) -> int64_type, {1}, {0}, target_id=0
83447-@1361 = slice[axes={0},starts={-1},ends={9223372036854775807}](@373) -> int64_type, {1}, {1}, target_id=0
83553-@1362 = unsqueeze[axes={0},steps={}](@1359) -> int64_type, {1}, {1}, target_id=0
83634-@1363 = unsqueeze[axes={0},steps={}](@1360) -> int64_type, {1}, {1}, target_id=0
83715-@1364 = squeeze[axes={0}](@1361) -> int64_type, {1}, {0}, target_id=0
83785-@1365 = concat[axis=0](@1362,@1363,@1068) -> int64_type, {3}, {1}, target_id=0
83864-@1366 = unsqueeze[axes={0},steps={}](@1364) -> int64_type, {1}, {1}, target_id=0
83945-@1367 = concat[axis=0](@1069,@1366) -> int64_type, {2}, {1}, target_id=0
84018-@1368 = reshape[dims={-1, 768}](@1358) -> float_type, {1, 768}, {768, 1}, target_id=0
84104-@1369 = reshape[dims={768}](@1368) -> float_type, {768}, {1}, target_id=0
84178-@1370 = concat[axis=0](@1369,@372) -> float_type, {769}, {1}, target_id=0
84252-@1371 = reduce_max[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84325-@1372 = reduce_min[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84398-@1373 = sub(@1371,@1372) -> float_type, {1}, {1}, target_id=0
84460-@1374 = div(@1373,@371) -> float_type, {1}, {1}, target_id=0
84521-@1375 = sub(@370,@1372) -> float_type, {1}, {1}, target_id=0
84582-@1376 = div(@1375,@1374) -> float_type, {1}, {1}, target_id=0
84644-@1377 = clip(@1376,@370,@369) -> float_type, {1}, {1}, target_id=0
84711-@1378 = nearbyint(@1377) -> float_type, {1}, {1}, target_id=0
84773-@1379 = convert[target_type=5](@1378) -> int8_type, {1}, {1}, target_id=0
84847-@1380 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1374) -> float_type, {1, 768}, {0, 0}, target_id=0
84957-@1381 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1379) -> int8_type, {1, 768}, {0, 0}, target_id=0
85066-@1382 = quantizelinear(@1368,@1380,@1381) -> uint8_type, {1, 768}, {768, 1}, target_id=0
85155-@1383 = mul(@1374,@1227) -> float_type, {1}, {1}, target_id=0
85217:@1384 = quant_dot(@1382,@1225) -> int32_type, {1, 2304}, {2304, 1}, target_id=0

add a convert step at the end of parse_dynamicquantizelinear to handle this as we'll bump up against MLIR converts

Upscaled to int16 before doing the convert to handle saturation before the convert to int8 (uint8->int16 - 127 -> int8)

Still seeing a perf drop though. Need to go over in morning if I need to add more to simplify_qdq

Block performing the conversion

@1422 = sub(@1420,@1421) -> float_type, {1}, {1}, target_id=0
@1423 = div(@1422,@420) -> float_type, {1}, {1}, target_id=0
@1424 = sub(@419,@1421) -> float_type, {1}, {1}, target_id=0
@1425 = div(@1424,@1423) -> float_type, {1}, {1}, target_id=0
@1426 = clip(@1425,@419,@418) -> float_type, {1}, {1}, target_id=0
@1427 = nearbyint(@1426) -> float_type, {1}, {1}, target_id=0
@1428 = convert[target_type=4](@1427) -> uint8_type, {1}, {1}, target_id=0
@1429 = convert[target_type=7](@1428) -> int16_type, {1}, {1}, target_id=0
@1430 = add(@1429,@417) -> int16_type, {1}, {1}, target_id=0
@1431 = convert[target_type=5](@1430) -> int8_type, {1}, {1}, target_id=0
@1432 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1423) -> float_type, {1, 768}, {0, 0}, target_id=0
@1433 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1431) -> int8_type, {1, 768}, {0, 0}, target_id=0
@1434 = quantizelinear[out_type=nullopt](@1417,@1432,@1433) -> int8_type, {1, 768}, {768, 1}, target_id=0

Perf output

Summary:
gpu::code_object::reduce_min_kernel: 2.06482ms / 49 = 0.0421391ms, 11%
gpu::code_object::reduce_max_sub_mul_kernel: 2.0575ms / 49 = 0.0419897ms, 11%
gpu::code_object::mul_quantizelinear_kernel: 1.67524ms / 48 = 0.0349009ms, 9%
gpu::code_object::mul_kernel: 1.59728ms / 50 = 0.0319455ms, 9%
gpu::code_object::mlir_quant_dot: 1.35165ms / 47 = 0.0287586ms, 8%
gpu::code_object::quantizelinear_kernel: 1.20892ms / 37 = 0.0326734ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16375ms / 49 = 0.0237501ms, 7%
gpu::code_object::concat_kernel: 1.15656ms / 49 = 0.0236034ms, 7%
gpu::code_object::convert_kernel: 1.15122ms / 50 = 0.0230244ms, 6%
gpu::code_object::contiguous_kernel: 1.13786ms / 48 = 0.0237053ms, 6%
gpu::code_object::neg_div_clip_nearbyint_add_kernel: 1.13631ms / 49 = 0.0231901ms, 6%
gpu::code_object::layernorm_mul_add_kernel: 0.585409ms / 24 = 0.0243921ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.538739ms / 23 = 0.0234234ms, 3%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.389363ms / 13 = 0.029951ms, 3%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.363619ms / 13 = 0.0279707ms, 2%
load: 0.360501ms / 600 = 0.000600835ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295485ms / 12 = 0.0246237ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel: 0.288516ms / 12 = 0.024043ms, 2%
multibroadcast: 0.251145ms / 296 = 0.000848464ms, 2%
reshape_lazy: 0.128273ms / 180 = 0.00071263ms, 1%
hip::hip_copy_literal: 0.104623ms / 151 = 0.00069287ms, 1%
transpose: 0.063942ms / 48 = 0.00133213ms, 1%
slice: 0.0456442ms / 36 = 0.00126789ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248973ms / 1 = 0.0248973ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0237355ms / 1 = 0.0237355ms, 1%
gpu::code_object::gather_kernel: 0.0237274ms / 1 = 0.0237274ms, 1%
@param: 0.0103636ms / 26 = 0.0003986ms, 1%
hip::hip_allocate_memory: 0.0011818ms / 1 = 0.0011818ms, 1%
check_context::migraphx::gpu::context: 0.0008362ms / 1 = 0.0008362ms, 1%

Batch size: 1
Rate: 157.206 inferences/sec
Total time: 6.36109ms
Total instructions time: 19.2011ms
Overhead time: 0.435307ms, -12.84ms
Overhead: 7%, -202%
[ MIGraphX Version: 2.9.0. ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8

Hey @pfultz2 got any idea on best to speed this one up? Should our quantize also be in mlir here not just the dequantize?

Latest changes in the PR seem to speed things up (remove flattening via reshape/concat, serialize min/max operations)

This appears to create a bout a 20% speedup alone relative to the original run on the int8 model.

Summary:
gpu::code_object::reduce_min_min_kernel: 2.04025ms / 49 = 0.0416377ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03854ms / 49 = 0.0416029ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66959ms / 48 = 0.0347831ms, 10%
gpu::code_object::mlir_quant_dot: 1.34554ms / 47 = 0.0286286ms, 9%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16123ms / 49 = 0.0236985ms, 7%
gpu::code_object::convert_kernel: 1.14344ms / 50 = 0.0228688ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.12494ms / 49 = 0.022958ms, 7%
gpu::code_object::mul_kernel: 1.11627ms / 49 = 0.022781ms, 7%
gpu::code_object::contiguous_kernel: 0.845617ms / 36 = 0.0234894ms, 6%
gpu::code_object::quantizelinear_kernel: 0.835653ms / 36 = 0.0232126ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583655ms / 24 = 0.024319ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537387ms / 23 = 0.0233647ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.388378ms / 13 = 0.0298752ms, 3%
load: 0.332797ms / 537 = 0.000619734ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.29401ms / 12 = 0.0245009ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel: 0.286982ms / 12 = 0.0239152ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280409ms / 12 = 0.0233674ms, 2%
multibroadcast: 0.252798ms / 295 = 0.000856943ms, 2%
hip::hip_copy_literal: 0.105973ms / 150 = 0.000706487ms, 1%
reshape_lazy: 0.0981803ms / 131 = 0.000749468ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0888045ms / 1 = 0.0888045ms, 1%
transpose: 0.059129ms / 48 = 0.00123185ms, 1%
slice: 0.0458985ms / 36 = 0.00127496ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248396ms / 1 = 0.0248396ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0239322ms / 1 = 0.0239322ms, 1%
gpu::code_object::gather_kernel: 0.0237319ms / 1 = 0.0237319ms, 1%
@param: 0.0106455ms / 26 = 0.000409442ms, 1%
check_context::migraphx::gpu::context: 0.0011866ms / 1 = 0.0011866ms, 1%
hip::hip_allocate_memory: 0.00104084ms / 1 = 0.00104084ms, 1%

Batch size: 1
Rate: 191.057 inferences/sec
Total time: 5.23404ms
Total instructions time: 16.7609ms
Overhead time: 0.382316ms, -11.5268ms
Overhead: 7%, -220%
[ MIGraphX Version: 2.9.0. ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8

Still seeing large time still done in the reduction min/max steps.

Curious if the above block before the quantize linear can be fused which adds a significant amount of time to the run

gpu::code_object::reduce_min_min_kernel: 2.04496ms / 49 = 0.0417338ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.04381ms / 49 = 0.0417105ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.68006ms / 48 = 0.0350012ms, 10%
gpu::code_object::mlir_quant_dot: 1.35275ms / 47 = 0.028782ms, 9%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16634ms / 49 = 0.0238029ms, 7%
gpu::code_object::convert_kernel: 1.1519ms / 50 = 0.0230379ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.13642ms / 49 = 0.0231923ms, 7%
gpu::code_object::mul_kernel: 1.1247ms / 49 = 0.0229531ms, 7%

@pfultz2 the gpt2 model this issue stemmed from has the following repeated everything as part of the inserted dynamic quantization step

Have initial changes after also reworking MatMulInteger after talking with @pfultz2

@causten we're seeing about a 30% increase once we're properly handling the input as quant_dot instead of just dots for the onnx model.

Summary of a run with only the change to MatMulInteger parser. on/off with disable fast math has around the same ballpark of a speedup (211-212 QPS)

Summary:
gpu::code_object::reduce_min_kernel: 2.06349ms / 49 = 0.042112ms, 14%
gpu::code_object::reduce_max_sub_mul_kernel: 2.05264ms / 49 = 0.0418906ms, 13%
gpu::code_object::mlir_quant_dot: 1.87745ms / 61 = 0.0307779ms, 12%
gpu::code_object::concat_kernel: 1.15502ms / 49 = 0.0235718ms, 8%
gpu::code_object::quantizelinear_sub_convert_add_convert_kernel: 1.14867ms / 49 = 0.0234422ms, 8%
gpu::code_object::mul_kernel: 1.13639ms / 49 = 0.0231916ms, 8%
gpu::code_object::neg_div_clip_nearbyint_convert_kernel: 1.13261ms / 49 = 0.0231144ms, 8%
gpu::code_object::contiguous_kernel: 1.12907ms / 48 = 0.0235223ms, 8%
gpu::code_object::quantizelinear_kernel: 0.836898ms / 36 = 0.0232472ms, 6%
gpu::code_object::layernorm_mul_add_kernel: 0.585637ms / 24 = 0.0244015ms, 4%
gpu::code_object::convert_mul_add_add_kernel: 0.545402ms / 23 = 0.0237131ms, 4%
gpu::code_object::convert_mul_add_kernel: 0.310343ms / 13 = 0.0238725ms, 2%
load: 0.307436ms / 515 = 0.000596963ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.294008ms / 12 = 0.0245007ms, 2%
gpu::code_object::convert_mul_add_mul_mul_add_mul_exp_add_div_kernel: 0.28876ms / 12 = 0.0240634ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285526ms / 12 = 0.0237939ms, 2%
multibroadcast: 0.220696ms / 246 = 0.000897138ms, 2%
reshape_lazy: 0.119711ms / 180 = 0.000665061ms, 1%
hip::hip_copy_literal: 0.100321ms / 151 = 0.00066438ms, 1%
transpose: 0.0685118ms / 48 = 0.00142733ms, 1%
slice: 0.0486096ms / 36 = 0.00135027ms, 1%
gpu::code_object::convert_mul_kernel: 0.0291348ms / 1 = 0.0291348ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0249158ms / 1 = 0.0249158ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0237881ms / 1 = 0.0237881ms, 1%
gpu::code_object::gather_kernel: 0.023749ms / 1 = 0.023749ms, 1%
gpu::code_object::convert_kernel: 0.0230378ms / 1 = 0.0230378ms, 1%
@param: 0.00976646ms / 26 = 0.000375633ms, 1%
hip::hip_allocate_memory: 0.000916ms / 1 = 0.000916ms, 1%
check_context::migraphx::gpu::context: 0.000771ms / 1 = 0.000771ms, 1%

Batch size: 1
Rate: 212.998 inferences/sec
Total time: 4.69488ms
Total instructions time: 15.8433ms
Overhead time: 0.376437ms, -11.1484ms
Overhead: 8%, -237%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8

Changes pushed to : https://github.com/ROCm/AMDMIGraphX/pull/2903

Seeing larger speedup with MatMulinteger (#2903) + Fixes to DynamicQuantizeLinea (#2896) that were added when running through ORT right now for GPT2. Testing other models through driver appeared to show the correct speedup as well.

For GPT (shown below) it appears we're slightly faster than fp16 runs now

int8

root@aus-navi3x-02:/onnxruntime/onnxruntime/python/tools/transformers# python3 benchmark.py -g -m gpt2 --model_class AutoModelForCausalLM  --sequence_length 32 384 --batch_sizes 1 8  --provider=migraphx -p int8 --disable_gelu --disable_layer_norm --disable_attention --disable_skip_layer_norm --disable_embed_layer_norm --disable_bias_skip_layer_norm --disable_bias_gelu -o no_opt
Arguments: Namespace(models=['gpt2'], model_source='pt', model_class='AutoModelForCausalLM', engines=['onnxruntime'], cache_dir='./cache_models', onnx_dir='./onnx_models', use_gpu=True, provider='migraphx', precision=<Precision.INT8: 'int8'>, verbose=False, overwrite=False, optimizer_info=<OptimizerInfo.NOOPT: 'no_opt'>, validate_onnx=False, fusion_csv=None, detail_csv=None, result_csv=None, input_counts=[1], test_times=100, batch_sizes=[1, 8], sequence_lengths=[32, 384], disable_ort_io_binding=False, num_threads=[16], force_num_layers=None, disable_attention=True, disable_skip_layer_norm=True, disable_embed_layer_norm=True, disable_bias_skip_layer_norm=True, disable_bias_gelu=True, disable_layer_norm=True, disable_gelu=True, enable_gelu_approximation=False, disable_shape_inference=False, enable_gemm_fast_gelu=False, use_mask_index=False, use_raw_attention_mask=False, no_attention_mask=False, use_multi_head_attention=False, disable_group_norm=False, disable_skip_group_norm=False, disable_packed_kv=False, disable_packed_qkv=False, disable_bias_add=False, disable_bias_splitgelu=False, disable_nhwc_conv=False, use_group_norm_channels_first=False, disable_rotary_embeddings=False)
OptimizerInfo is set to no_opt, graph optimizations specified in FusionOptions are not applied.
Model class name: AutoModelForCausalLM
Skip export since model existed: ./onnx_models/gpt2_1.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:29:43.346929', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.96', 'latency_95_percentile': '3.00', 'latency_99_percentile': '3.11', 'average_latency_ms': '2.67', 'QPS': '374.99'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:30:08.887040', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '7.75', 'latency_95_percentile': '7.78', 'latency_99_percentile': '7.81', 'average_latency_ms': '7.54', 'QPS': '132.68'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:30:44.223341', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '5.99', 'latency_95_percentile': '6.31', 'latency_99_percentile': '6.46', 'average_latency_ms': '5.92', 'QPS': '1351.07'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:31:09.392412', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '49.11', 'latency_95_percentile': '49.53', 'latency_99_percentile': '49.77', 'average_latency_ms': '48.40', 'QPS': '165.29'}
Detail results are saved to csv file: benchmark_detail_20240320-133156.csv
Summary results are saved to csv file: benchmark_summary_20240320-133156.csv

fp16 runs

root@aus-navi3x-02:/onnxruntime/onnxruntime/python/tools/transformers# python3 benchmark.py -g -m gpt2 --model_class AutoModelForCausalLM  --sequence_length 32 384 --batch_sizes 1 8  --provider=migraphx -p fp16 --disable_gelu --disable_layer_norm --disable_attention --disable_skip_layer_norm --disable_embed_layer_norm --disable_bias_skip_layer_norm --disable_bias_gelu -o no_opt
Arguments: Namespace(models=['gpt2'], model_source='pt', model_class='AutoModelForCausalLM', engines=['onnxruntime'], cache_dir='./cache_models', onnx_dir='./onnx_models', use_gpu=True, provider='migraphx', precision=<Precision.FLOAT16: 'fp16'>, verbose=False, overwrite=False, optimizer_info=<OptimizerInfo.NOOPT: 'no_opt'>, validate_onnx=False, fusion_csv=None, detail_csv=None, result_csv=None, input_counts=[1], test_times=100, batch_sizes=[1, 8], sequence_lengths=[32, 384], disable_ort_io_binding=False, num_threads=[16], force_num_layers=None, disable_attention=True, disable_skip_layer_norm=True, disable_embed_layer_norm=True, disable_bias_skip_layer_norm=True, disable_bias_gelu=True, disable_layer_norm=True, disable_gelu=True, enable_gelu_approximation=False, disable_shape_inference=False, enable_gemm_fast_gelu=False, use_mask_index=False, use_raw_attention_mask=False, no_attention_mask=False, use_multi_head_attention=False, disable_group_norm=False, disable_skip_group_norm=False, disable_packed_kv=False, disable_packed_qkv=False, disable_bias_add=False, disable_bias_splitgelu=False, disable_nhwc_conv=False, use_group_norm_channels_first=False, disable_rotary_embeddings=False)
OptimizerInfo is set to no_opt, graph optimizations specified in FusionOptions are not applied.
Model class name: AutoModelForCausalLM
Skip export since model existed: ./onnx_models/gpt2_1.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:35:52.919367', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.76', 'latency_95_percentile': '2.78', 'latency_99_percentile': '2.79', 'average_latency_ms': '2.71', 'QPS': '368.84'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:36:16.149473', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '7.67', 'latency_95_percentile': '7.69', 'latency_99_percentile': '7.72', 'average_latency_ms': '7.49', 'QPS': '133.57'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:36:48.681642', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '6.07', 'latency_95_percentile': '6.33', 'latency_99_percentile': '6.52', 'average_latency_ms': '6.00', 'QPS': '1334.37'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:37:10.933650', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '48.85', 'latency_95_percentile': '49.07', 'latency_99_percentile': '49.15', 'average_latency_ms': '47.97', 'QPS': '166.79'}
Detail results are saved to csv file: benchmark_detail_20240320-133755.csv
Summary results are saved to csv file: benchmark_summary_20240320-133755.csv

ROCm / AMDMIGraphX

Enable ORT accuracy tests to verify int8 #1904