Closed TedThemistokleous closed 6 months ago
Seems like #2300 will aid in accuracy for these runs.
Reopenning, need to rerun testing with these changes + update ORT EP
Resnet50 runs, need to do further analysis to compare to fp16 through same pipeline.
Plan of attack
Added models will leverage existing e2e code, may need to write/borrow code for benchmark.py thats used in parity checks.
Doing this part of QA validation for the resnet50 pipeline added in onnxruntime-inference-examples
Got DLM changes for benchmark.py but failing off 6.0 sorting out issues with run scripts.
Tracked by JIRA: https://ontrack-internal.amd.com/browse/SWDEV-410597
Able to get run for bert-large, bert-based-cased and distilgpt2 model runs into DLM reusing existing runs. Missing GPT2 as referenced in #1905 . Will need to add gpt2 + requivalent int8 run.
We're letting onnxruntime do the quantization of the model before we do a run through the MIGraphX EP right now.
Running these by hand to verify
bert_large_uncased int8 Quantized.
Finished quantizing model: ./onnx_models/bert_large_uncased_1_int8_gpu.onnx
Run onnxruntime on bert-large-uncased with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-large-uncased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:36:35.944933', 'test_times': 100, 'latency_variance': '0.13', 'latency_90_percentile': '175.18', 'latency_95_percentile': '175.84', 'latency_99_percentile': '211.92', 'average_latency_ms': '176.60', 'QPS': '5.66'}
Run onnxruntime on bert-large-uncased with input shape [16, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-large-uncased', 'inputs': 1, 'threads': 16, 'batch_size': 16, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:36:54.211083', 'test_times': 100, 'latency_variance': '1164.47', 'latency_90_percentile': '5226.89', 'latency_95_percentile': '5381.81', 'latency_99_percentile': '5932.76', 'average_latency_ms': '3711.79', 'QPS': '4.31'}
bert_based_cased int8 Quantized
Finished quantizing model: ./onnx_models/bert_base_cased_1_int8_gpu.onnx
Run onnxruntime on bert-base-cased with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:15.985903', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '12.94', 'latency_95_percentile': '15.32', 'latency_99_percentile': '15.75', 'average_latency_ms': '12.85', 'QPS': '77.80'}
Run onnxruntime on bert-base-cased with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:17.620586', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '54.57', 'latency_95_percentile': '54.61', 'latency_99_percentile': '55.08', 'average_latency_ms': '54.20', 'QPS': '18.45'}
Run onnxruntime on bert-base-cased with input shape [32, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'bert-base-cased', 'inputs': 1, 'threads': 16, 'batch_size': 32, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:51:23.099155', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '127.41', 'latency_95_percentile': '127.57', 'latency_99_percentile': '128.02', 'average_latency_ms': '126.68', 'QPS': '252.61'}
Run onnxruntime on bert-base-cased with input shape [32, 384]
distilgpt2 int8 Quantized
Size of quantized ONNX model(MB):116.36144828796387
Finished quantizing model: ./onnx_models/distilgpt2_1_int8_gpu.onnx
Run onnxruntime on distilgpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:43.699617', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '8.07', 'latency_95_percentile': '10.20', 'latency_99_percentile': '10.26', 'average_latency_ms': '8.14', 'QPS': '122.79'}
Run onnxruntime on distilgpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:44.836106', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '45.96', 'latency_95_percentile': '45.97', 'latency_99_percentile': '46.01', 'average_latency_ms': '45.78', 'QPS': '21.84'}
Run onnxruntime on distilgpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 19:54:49.480584', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '32.90', 'latency_95_percentile': '32.95', 'latency_99_percentile': '33.15', 'average_latency_ms': '32.70', 'QPS': '244.62'}
Run onnxruntime on distilgpt2 with input shape [8, 384]
Got a gpt2 run with int8 quant here.
quantized model saved to:./onnx_models/gpt2_1_int8_gpu.onnx
Size of quantized ONNX model(MB):157.34468364715576
Finished quantizing model: ./onnx_models/gpt2_1_int8_gpu.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:05.271922', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '17.57', 'latency_95_percentile': '17.61', 'latency_99_percentile': '17.68', 'average_latency_ms': '15.39', 'QPS': '65.00'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:07.146352', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '72.94', 'latency_95_percentile': '73.17', 'latency_99_percentile': '73.99', 'average_latency_ms': '72.80', 'QPS': '13.74'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:14.528135', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '52.48', 'latency_95_percentile': '52.50', 'latency_99_percentile': '52.58', 'average_latency_ms': '52.30', 'QPS': '152.97'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-06 20:57:19.830109', 'test_times': 100, 'latency_variance': '0.01', 'latency_90_percentile': '529.51', 'latency_95_percentile': '530.92', 'latency_99_percentile': '544.40', 'average_latency_ms': '529.13', 'QPS': '15.12'}
Fusion statistics is saved to csv file: benchmark_fusion_20231206-205813.csv
Detail results are saved to csv file: /tmp/results.csv
Summary results are saved to csv file: benchmark_summary_20231206-205813.csv
changes pushed into #2468
Not seeing proper code path when running trace compile for tests in DLM with MIGRAPHX_TRACE_EVAL=1 after investigating large drop in outputs compared to fp16 versions. Sorting this out before I close it out.
Seeing about an order of magnitude drop between fp16 and int8 runs eg) distilgpt2 below
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:38.877793', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.61', 'latency_95_percentile': '10.79', 'latency_99_percentile': '11.09', 'average_latency_ms': '10.57', 'QPS': '94.61'}
Run onnxruntime on distilgpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:40.258887', 'test_times': 100, 'latency_variance': '0.19', 'latency_90_percentile': '80.93', 'latency_95_percentile': '81.12', 'latency_99_percentile': '81.47', 'average_latency_ms': '61.06', 'QPS': '16.38'}
Run onnxruntime on distilgpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:46.467054', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '36.39', 'latency_95_percentile': '36.77', 'latency_99_percentile': '36.95', 'average_latency_ms': '35.85', 'QPS': '223.17'}
Run onnxruntime on distilgpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:41:50.101417', 'test_times': 100, 'latency_variance': '1.09', 'latency_90_percentile': '362.16', 'latency_95_percentile': '409.24', 'latency_99_percentile': '554.73', 'average_latency_ms': '363.10', 'QPS': '22.03'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:06.612680', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.03', 'latency_95_percentile': '1.03', 'latency_99_percentile': '1.04', 'average_latency_ms': '1.01', 'QPS': '987.05'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:23.724592', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.03', 'latency_95_percentile': '2.04', 'latency_99_percentile': '2.11', 'average_latency_ms': '2.00', 'QPS': '500.15'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:01.631328', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.02', 'latency_95_percentile': '2.13', 'latency_99_percentile': '2.16', 'average_latency_ms': '1.97', 'QPS': '4051.12'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:19.493161', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.90', 'latency_95_percentile': '10.92', 'latency_99_percentile': '11.03', 'average_latency_ms': '10.72', 'QPS': '746.60'}
It appears we're bouncing between all 3 EPs when doing int8 runs.
Attempting to run the model in the driver I'm seeing the following:
terminate called after throwing an instance of 'migraphx::version_2_8_0::exception'
what(): /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/AMDMIGraphX/src/onnx/onnx_parser.cpp:417: parse_graph: Unknown operator: DynamicQuantizeLinear
Aborted (core dumped)
Which would most likely be why in the EP we don't find that OP and then fallback to the rest of the nodes put onto the other EPs
Hmm we added support for that operator
Looks like we're using an older version.
MIGraphX Version: 2.8.0.7f8f0fd0f
Also DynamicQuantizeLinear isn't in the ep list of OPs. Have a patch in comming to add it in.
Got fixes up to here:
Upstream - https://github.com/microsoft/onnxruntime/pull/18798 Internal - https://github.com/ROCmSoftwarePlatform/onnxruntime/pull/26
Running a test with latest develop + change in onnxruntime + dlm container for the gpt2 test for int8. Will try to run the other models and analyze once complete as well.
Using this to build the end to end pipeline
python3 tools/run_models.py --tags migx_onnxrt_gpt2_quant_benchmarks --liveOutput --cleanDockerCache --additionalContext "{'guest_os':'UBUNTU', \
'docker_build_arg':{\
'BASE_DOCKER':'compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.0:88_ubuntu22.04_py3.10_pytorch_release-2.1_011de5c', \
'ORT_UNIT_TESTS':'false', 'ORT_BUILD':'true', 'ONNXRUNTIME_BRANCH':'add_dynamic_quantize_linear','ONNXRUNTIME_REPO':'https://github.com/ROCmSoftwarePlatform/onnxruntime', 'MIGX_BUILD':'true'}}"
@causten for the MIGraphX side, looking at develop, your right, we should have this op in. Need to figure out where APT is getting things here.
Build off develop seems to work to read the int8 model correctly
@2689 = @return(@2688,@1075,@1077,@1211,@1213,@1347,@1349,@1483,@1485,@1619,@1621,@1755,@1757,@1891,@1893,@2027,@2029,@2163,@2165,@2299,@2301,@2435,@2437,@2571,@2573), target_id=0
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver read gpt2_1_int8_gpu.onnx
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver run gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids
From the ORT benchmark test the fp16 run got
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:06.612680', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.03', 'latency_95_percentile': '1.03', 'latency_99_percentile': '1.04', 'average_latency_ms': '1.01', 'QPS': '987.05'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:43:23.724592', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.03', 'latency_95_percentile': '2.04', 'latency_99_percentile': '2.11', 'average_latency_ms': '2.00', 'QPS': '500.15'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:01.631328', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.02', 'latency_95_percentile': '2.13', 'latency_99_percentile': '2.16', 'average_latency_ms': '1.97', 'QPS': '4051.12'}
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'distilgpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-08 22:44:19.493161', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '10.90', 'latency_95_percentile': '10.92', 'latency_99_percentile': '11.03', 'average_latency_ms': '10.72', 'QPS': '746.60'}
edit rerunning this test off develop + latest changes for FP16 I get the following timings. which is 50% lower than previous. My understanding here is that we have fast math disabled on for both cases so that shouldn't be the cause..
Model saved to ./onnx_models/gpt2_1_fp16_gpu.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 03:06:14.856181', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '1.83', 'latency_95_percentile': '1.84', 'latency_99_percentile': '1.87', 'average_latency_ms': '1.82', 'QPS': '549.61'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-13 03:06:42.573559', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '3.63', 'latency_95_percentile': '3.64', 'latency_99_percentile': '3.67', 'average_latency_ms': '3.55', 'QPS': '282.03'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 03:07:27.814779', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '3.41', 'latency_95_percentile': '3.66', 'latency_99_percentile': '3.71', 'average_latency_ms': '3.38', 'QPS': '2368.26'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2023-12-13 03:07:56.767108', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '18.09', 'latency_95_percentile': '18.24', 'latency_99_percentile': '18.71', 'average_latency_ms': '17.62', 'QPS': '454.15'}
Trying a perf run with only int8 on in the driver we're seeing a value on our end that's only
gpu::code_object::quantizelinear_kernel: 1.37962ms / 60 = 0.0229937ms, 21%
gpu::code_object::contiguous_kernel: 0.835693ms / 36 = 0.0232137ms, 13%
gpu::code_object::mlir_reshape_quant_dot: 0.755034ms / 23 = 0.0328275ms, 12%
gpu::code_object::layernorm_mul_add_quantizelinear_kernel: 0.581659ms / 24 = 0.0242358ms, 9%
gpu::code_object::dequantizelinear_add_add_kernel: 0.525061ms / 23 = 0.0228287ms, 8%
gpu::code_object::mlir_reshape_quant_dot_dequantizelinear_add: 0.387697ms / 13 = 0.0298228ms, 6%
gpu::code_object::mlir_quant_dot: 0.338115ms / 12 = 0.0281762ms, 6%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_quantizelinear_kernel: 0.309476ms / 12 = 0.0257897ms, 5%
gpu::code_object::softmax_kernel: 0.289521ms / 12 = 0.0241267ms, 5%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul_where: 0.283153ms / 12 = 0.0235961ms, 5%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.279093ms / 12 = 0.0232578ms, 5%
load: 0.119553ms / 219 = 0.000545903ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.117032ms / 1 = 0.117032ms, 2%
multibroadcast: 0.111057ms / 98 = 0.00113324ms, 2%
hip::hip_copy_literal: 0.0854196ms / 149 = 0.000573286ms, 2%
reshape_lazy: 0.0637735ms / 95 = 0.0006713ms, 1%
transpose: 0.0545841ms / 48 = 0.00113717ms, 1%
slice: 0.0419648ms / 36 = 0.00116569ms, 1%
gpu::code_object::add_layernorm_quantizelinear_kernel: 0.0241358ms / 1 = 0.0241358ms, 1%
gpu::code_object::gather_kernel: 0.0231269ms / 1 = 0.0231269ms, 1%
gpu::code_object::add_kernel: 0.0230521ms / 1 = 0.0230521ms, 1%
gpu::code_object::convert_kernel: 0.0227299ms / 1 = 0.0227299ms, 1%
@param: 0.00901038ms / 26 = 0.000346553ms, 1%
hip::hip_allocate_memory: 0.0008224ms / 1 = 0.0008224ms, 1%
check_context::migraphx::gpu::context: 0.00066418ms / 1 = 0.00066418ms, 1%
Batch size: 1
Rate: 585.383 inferences/sec
Total time: 1.70828ms
Total instructions time: 6.66105ms
Overhead time: 0.18396ms, -4.95276ms
Overhead: 11%, -290%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8
Soley fp16 run gives us
Summary:
gpu::code_object::mlir_reshape_dot: 0.826736ms / 23 = 0.0359451ms, 16%
gpu::code_object::convert_kernel: 0.57429ms / 25 = 0.0229716ms, 11%
gpu::code_object::layernorm_mul_add_kernel: 0.566644ms / 24 = 0.0236102ms, 11%
gpu::code_object::contiguous_kernel: 0.533037ms / 24 = 0.0222099ms, 10%
gpu::code_object::add_add_kernel: 0.514395ms / 23 = 0.022365ms, 10%
gpu::code_object::mlir_reshape_dot_add: 0.391352ms / 13 = 0.030104ms, 8%
gpu::code_object::mlir_transpose_reshape_dot: 0.324614ms / 12 = 0.0270511ms, 6%
gpu::code_object::add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.300936ms / 12 = 0.025078ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_reshape_slice_transpose_dot_mul_where: 0.276505ms / 12 = 0.023042ms, 6%
gpu::code_object::softmax_kernel: 0.275471ms / 12 = 0.0229559ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_dot: 0.27312ms / 12 = 0.02276ms, 6%
gpu::code_object::mlir_dot_add_convert: 0.148124ms / 1 = 0.148124ms, 3%
multibroadcast: 0.0958457ms / 98 = 0.000978017ms, 2%
load: 0.0918247ms / 171 = 0.000536986ms, 2%
hip::hip_copy_literal: 0.0814319ms / 149 = 0.000546523ms, 2%
reshape_lazy: 0.0574439ms / 83 = 0.000692095ms, 2%
slice: 0.0288253ms / 24 = 0.00120106ms, 1%
gpu::code_object::add_layernorm_kernel: 0.0230485ms / 1 = 0.0230485ms, 1%
gpu::code_object::gather_kernel: 0.0226782ms / 1 = 0.0226782ms, 1%
gpu::code_object::add_kernel: 0.0221361ms / 1 = 0.0221361ms, 1%
transpose: 0.0186464ms / 24 = 0.000776932ms, 1%
@param: 0.0092624ms / 26 = 0.000356246ms, 1%
hip::hip_allocate_memory: 0.0007636ms / 1 = 0.0007636ms, 1%
check_context::migraphx::gpu::context: 0.0006462ms / 1 = 0.0006462ms, 1%
Batch size: 1
Rate: 624.128 inferences/sec
Total time: 1.60223ms
Total instructions time: 5.45778ms
Overhead time: 0.148524ms, -3.85554ms
Overhead: 9%, -241%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --fp16
With mixed int8 and fp16 we get the following
Summary:
gpu::code_object::quantizelinear_kernel: 1.38454ms / 60 = 0.0230757ms, 20%
gpu::code_object::contiguous_kernel: 0.819941ms / 36 = 0.0227761ms, 12%
gpu::code_object::mlir_reshape_quant_dot: 0.759796ms / 23 = 0.0330346ms, 11%
gpu::code_object::convert_kernel: 0.597603ms / 25 = 0.0239041ms, 9%
gpu::code_object::layernorm_mul_add_quantizelinear_kernel: 0.576765ms / 24 = 0.0240319ms, 8%
gpu::code_object::dequantizelinear_add_add_kernel: 0.526679ms / 23 = 0.0228991ms, 8%
gpu::code_object::mlir_reshape_quant_dot_dequantizelinear_add: 0.388696ms / 13 = 0.0298997ms, 6%
gpu::code_object::mlir_quant_dot: 0.339814ms / 12 = 0.0283178ms, 5%
gpu::code_object::dequantizelinear_add_pow_mul_add_mul_tanh_add_mul_mul_quantizelinear_kernel: 0.311157ms / 12 = 0.0259297ms, 5%
gpu::code_object::softmax_kernel: 0.286588ms / 12 = 0.0238823ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul_where: 0.28528ms / 12 = 0.0237733ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.282882ms / 12 = 0.0235735ms, 4%
load: 0.139606ms / 243 = 0.000574509ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear_add_convert: 0.11139ms / 1 = 0.11139ms, 2%
multibroadcast: 0.110408ms / 98 = 0.00112661ms, 2%
hip::hip_copy_literal: 0.0868224ms / 149 = 0.000582701ms, 2%
reshape_lazy: 0.0709634ms / 95 = 0.000746983ms, 1%
transpose: 0.054488ms / 48 = 0.00113517ms, 1%
slice: 0.0448662ms / 36 = 0.00124628ms, 1%
gpu::code_object::add_kernel: 0.0239382ms / 1 = 0.0239382ms, 1%
gpu::code_object::add_layernorm_quantizelinear_kernel: 0.0238166ms / 1 = 0.0238166ms, 1%
gpu::code_object::gather_kernel: 0.0233278ms / 1 = 0.0233278ms, 1%
@param: 0.0093938ms / 26 = 0.0003613ms, 1%
hip::hip_allocate_memory: 0.0007862ms / 1 = 0.0007862ms, 1%
check_context::migraphx::gpu::context: 0.0006766ms / 1 = 0.0006766ms, 1%
Batch size: 1
Rate: 455.81 inferences/sec
Total time: 2.1939ms
Total instructions time: 7.26023ms
Overhead time: 0.193532ms, -5.06633ms
Overhead: 9%, -231%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --fp16 --int8
After building in DynamicQuantizeLinear into MIGraphX EP + newer develop we're seeing the following.
un onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.BYSCRIPT: 'by_script'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2023-12-13 02:18:41.520488', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '17.70', 'latency_95_percentile': '17.72', 'latency_99_percentile': '17.74', 'average_latency_ms': '17.60', 'QPS': '56.81'}
``
QPS: 56.81 which is almost half of the initial int8 run.
I think it is related to fast math.
Running fp16 on our driver as a baseline
Summary:
gpu::code_object::mlir_reshape_dot: 0.764539ms / 23 = 0.0332408ms, 14%
gpu::code_object::convert_kernel: 0.593802ms / 25 = 0.0237521ms, 11%
gpu::code_object::layernorm_mul_add_kernel: 0.570796ms / 24 = 0.0237832ms, 11%
gpu::code_object::contiguous_kernel: 0.538753ms / 24 = 0.022448ms, 10%
gpu::code_object::add_add_kernel: 0.51837ms / 23 = 0.0225378ms, 10%
gpu::code_object::mlir_reshape_dot_add: 0.387976ms / 13 = 0.0298443ms, 8%
gpu::code_object::mlir_transpose_reshape_dot: 0.331672ms / 12 = 0.0276394ms, 7%
gpu::code_object::add_pow_mul_add_mul_tanh_add_mul_mul_kernel: 0.304019ms / 12 = 0.0253349ms, 6%
gpu::code_object::softmax_kernel: 0.28573ms / 12 = 0.0238108ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_reshape_slice_transpose_dot_mul_where: 0.285179ms / 12 = 0.0237649ms, 6%
gpu::code_object::mlir_reshape_transpose_slice_dot: 0.285079ms / 12 = 0.0237566ms, 6%
gpu::code_object::mlir_dot_add_convert: 0.148713ms / 1 = 0.148713ms, 3%
multibroadcast: 0.11099ms / 98 = 0.00113255ms, 3%
load: 0.0965821ms / 171 = 0.000564807ms, 2%
hip::hip_copy_literal: 0.0866334ms / 149 = 0.000581432ms, 2%
reshape_lazy: 0.0606654ms / 83 = 0.000730908ms, 2%
slice: 0.0382431ms / 24 = 0.00159346ms, 1%
gpu::code_object::add_layernorm_kernel: 0.0234014ms / 1 = 0.0234014ms, 1%
gpu::code_object::gather_kernel: 0.0232851ms / 1 = 0.0232851ms, 1%
gpu::code_object::add_kernel: 0.0230552ms / 1 = 0.0230552ms, 1%
transpose: 0.0194102ms / 24 = 0.00080876ms, 1%
@param: 0.00970014ms / 26 = 0.000373082ms, 1%
hip::hip_allocate_memory: 0.0007364ms / 1 = 0.0007364ms, 1%
check_context::migraphx::gpu::context: 0.0006434ms / 1 = 0.0006434ms, 1%
Batch size: 1
Rate: 562.497 inferences/sec
Total time: 1.77779ms
Total instructions time: 5.50797ms
Overhead time: 0.151713ms, -3.73019ms
Overhead: 9%, -210%
[ MIGraphX Version: 2.9.0.5fe1b07 ] Complete: migraphx-driver perf gpt2_1_fp16_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --fp16
which similarly to the fp16 run through the ep is around 549 QPS and in the EP we have fast math default off right now due to the accuracy issue we saw previously.
Running this off latest develop I'm seeing this error now when trying to run latest int8 model. Rolling back support when we added DynamicQuantizeLinear added into the opset from two days ago.
@1010 = convert[target_type=4](@1009) -> uint8_type, {1}, {1}, target_id=0
@1011 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1005) -> float_type, {1, 768}, {0, 0}, target_id=0
@1012 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1010) -> uint8_type, {1, 768}, {0, 0}, target_id=0
@1013 = quantizelinear(@999,@1011,@1012) -> uint8_type, {1, 768}, {768, 1}, target_id=0
@1014 = mul(@1005,@858) -> float_type, {1}, {1}, target_id=0
terminate called after throwing an instance of 'migraphx::version_1::exception'
what(): /workspace/migraphx/src/src/include/migraphx/check_shapes.hpp:210: same_type: quant_dot: Types do not match
Aborted (core dumped)
root@aus-navi3x-02:/workspace/onnxruntime/build/Linux/Release/onnxruntime/transformers/onnx_models# cd /workspace/migraphx/src/
I think this is related to migraphx::shape::uint8_type cropping up in the output. If I losen the same_type constraint to only the fp8 types in quant_dot for now I can get gpt2 to read correctly like before.
It appears we're not adding either uint8 as part of our supported types or we're hitting a case where a convert that convert listed above is happening after we perform quant so when we go to compute_shape() we fail.
I'll add an eliminate_data_type pass and see if this helps to convert uint8->int8 albeit I think we'd need to be concerned about narrowing here for accuracy but currently building something to test this
Runs seem to still fail. without a the elminate_data_type pass for uint8, I'm getting robBlas failures
With the added pass I get
Reading: gpt2_1_int8_gpu.onnx
terminate called after throwing an instance of 'migraphx::version_1::exception'
what(): /workspace/AMDMIGraphX/src/targets/gpu/mlir.cpp:706: run_high_level_pipeline: Invalid MLIR created: Error: 'migraphx.dot' op operand #0 must be !migraphx.shaped of 32-bit float or 16-bit float or bfloat16 type values, but got '!migraphx.shaped<32x768xi8, 768x1>'
Note: see current operation: %0 = "migraphx.dot"(%arg0, %arg1) : (!migraphx.shaped<32x768xi8, 768x1>, !migraphx.shaped<768x2304xi8, 2304x1>) -> !migraphx.shaped<32x2304xi8, 2304x1>
Aborted (core dumped)
Pushed up changes to debug_quant_dot
branch. May try a later ROCm build since I'm still using build 88 from the 6.0 release cycle.
Tried some more stuff taking a look at quantizelinear which default returns to uint8 if we only have two input ops. Messing with that seems to also break witht the same mlir result above.
running the following after a read gives me this output where we're seeing the uint8 type popping up
migraphx-driver read gpt2_1_int8_gpu.onnx | grep quant_dot -b25 | head -26
83303-@1359 = gather[axis=0](@375,@1074) -> int64_type, {1}, {0}, target_id=0
83375-@1360 = gather[axis=0](@374,@1073) -> int64_type, {1}, {0}, target_id=0
83447-@1361 = slice[axes={0},starts={-1},ends={9223372036854775807}](@373) -> int64_type, {1}, {1}, target_id=0
83553-@1362 = unsqueeze[axes={0},steps={}](@1359) -> int64_type, {1}, {1}, target_id=0
83634-@1363 = unsqueeze[axes={0},steps={}](@1360) -> int64_type, {1}, {1}, target_id=0
83715-@1364 = squeeze[axes={0}](@1361) -> int64_type, {1}, {0}, target_id=0
83785-@1365 = concat[axis=0](@1362,@1363,@1068) -> int64_type, {3}, {1}, target_id=0
83864-@1366 = unsqueeze[axes={0},steps={}](@1364) -> int64_type, {1}, {1}, target_id=0
83945-@1367 = concat[axis=0](@1069,@1366) -> int64_type, {2}, {1}, target_id=0
84018-@1368 = reshape[dims={-1, 768}](@1358) -> float_type, {1, 768}, {768, 1}, target_id=0
84104-@1369 = reshape[dims={768}](@1368) -> float_type, {768}, {1}, target_id=0
84178-@1370 = concat[axis=0](@1369,@372) -> float_type, {769}, {1}, target_id=0
84252-@1371 = reduce_max[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84325-@1372 = reduce_min[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84398-@1373 = sub(@1371,@1372) -> float_type, {1}, {1}, target_id=0
84460-@1374 = div(@1373,@371) -> float_type, {1}, {1}, target_id=0
84521-@1375 = sub(@370,@1372) -> float_type, {1}, {1}, target_id=0
84582-@1376 = div(@1375,@1374) -> float_type, {1}, {1}, target_id=0
84644-@1377 = clip(@1376,@370,@369) -> float_type, {1}, {1}, target_id=0
84711-@1378 = nearbyint(@1377) -> float_type, {1}, {1}, target_id=0
84773-@1379 = convert[target_type=4](@1378) -> uint8_type, {1}, {1}, target_id=0
84848-@1380 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1374) -> float_type, {1, 768}, {0, 0}, target_id=0
84958-@1381 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1379) -> uint8_type, {1, 768}, {0, 0}, target_id=0
85068-@1382 = quantizelinear(@1368,@1380,@1381) -> uint8_type, {1, 768}, {768, 1}, target_id=0
85157-@1383 = mul(@1374,@1227) -> float_type, {1}, {1}, target_id=0
85219:@1384 = quant_dot(@1382,@1225) -> int32_type, {1, 2304}, {2304, 1}, target_id=0
Looks like this uint8 is sneaking in from how we handle dynamicQuantize linear. Not sure why we're assuming the type is supposd to be uint8 here instead of int8. Will need to investigate further next week.
Changing the output target type for zero point to uint type around line 140 in parse_dynamicquantizelinear seems to fix inserting uint8 now
83303-@1359 = gather[axis=0](@375,@1074) -> int64_type, {1}, {0}, target_id=0
83375-@1360 = gather[axis=0](@374,@1073) -> int64_type, {1}, {0}, target_id=0
83447-@1361 = slice[axes={0},starts={-1},ends={9223372036854775807}](@373) -> int64_type, {1}, {1}, target_id=0
83553-@1362 = unsqueeze[axes={0},steps={}](@1359) -> int64_type, {1}, {1}, target_id=0
83634-@1363 = unsqueeze[axes={0},steps={}](@1360) -> int64_type, {1}, {1}, target_id=0
83715-@1364 = squeeze[axes={0}](@1361) -> int64_type, {1}, {0}, target_id=0
83785-@1365 = concat[axis=0](@1362,@1363,@1068) -> int64_type, {3}, {1}, target_id=0
83864-@1366 = unsqueeze[axes={0},steps={}](@1364) -> int64_type, {1}, {1}, target_id=0
83945-@1367 = concat[axis=0](@1069,@1366) -> int64_type, {2}, {1}, target_id=0
84018-@1368 = reshape[dims={-1, 768}](@1358) -> float_type, {1, 768}, {768, 1}, target_id=0
84104-@1369 = reshape[dims={768}](@1368) -> float_type, {768}, {1}, target_id=0
84178-@1370 = concat[axis=0](@1369,@372) -> float_type, {769}, {1}, target_id=0
84252-@1371 = reduce_max[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84325-@1372 = reduce_min[axes={0}](@1370) -> float_type, {1}, {1}, target_id=0
84398-@1373 = sub(@1371,@1372) -> float_type, {1}, {1}, target_id=0
84460-@1374 = div(@1373,@371) -> float_type, {1}, {1}, target_id=0
84521-@1375 = sub(@370,@1372) -> float_type, {1}, {1}, target_id=0
84582-@1376 = div(@1375,@1374) -> float_type, {1}, {1}, target_id=0
84644-@1377 = clip(@1376,@370,@369) -> float_type, {1}, {1}, target_id=0
84711-@1378 = nearbyint(@1377) -> float_type, {1}, {1}, target_id=0
84773-@1379 = convert[target_type=5](@1378) -> int8_type, {1}, {1}, target_id=0
84847-@1380 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1374) -> float_type, {1, 768}, {0, 0}, target_id=0
84957-@1381 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1379) -> int8_type, {1, 768}, {0, 0}, target_id=0
85066-@1382 = quantizelinear(@1368,@1380,@1381) -> uint8_type, {1, 768}, {768, 1}, target_id=0
85155-@1383 = mul(@1374,@1227) -> float_type, {1}, {1}, target_id=0
85217:@1384 = quant_dot(@1382,@1225) -> int32_type, {1, 2304}, {2304, 1}, target_id=0
add a convert step at the end of parse_dynamicquantizelinear to handle this as we'll bump up against MLIR converts
Upscaled to int16 before doing the convert to handle saturation before the convert to int8 (uint8->int16 - 127 -> int8)
Still seeing a perf drop though. Need to go over in morning if I need to add more to simplify_qdq
Block performing the conversion
@1422 = sub(@1420,@1421) -> float_type, {1}, {1}, target_id=0
@1423 = div(@1422,@420) -> float_type, {1}, {1}, target_id=0
@1424 = sub(@419,@1421) -> float_type, {1}, {1}, target_id=0
@1425 = div(@1424,@1423) -> float_type, {1}, {1}, target_id=0
@1426 = clip(@1425,@419,@418) -> float_type, {1}, {1}, target_id=0
@1427 = nearbyint(@1426) -> float_type, {1}, {1}, target_id=0
@1428 = convert[target_type=4](@1427) -> uint8_type, {1}, {1}, target_id=0
@1429 = convert[target_type=7](@1428) -> int16_type, {1}, {1}, target_id=0
@1430 = add(@1429,@417) -> int16_type, {1}, {1}, target_id=0
@1431 = convert[target_type=5](@1430) -> int8_type, {1}, {1}, target_id=0
@1432 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1423) -> float_type, {1, 768}, {0, 0}, target_id=0
@1433 = multibroadcast[out_lens={1, 768},out_dyn_dims={}](@1431) -> int8_type, {1, 768}, {0, 0}, target_id=0
@1434 = quantizelinear[out_type=nullopt](@1417,@1432,@1433) -> int8_type, {1, 768}, {768, 1}, target_id=0
Perf output
Summary:
gpu::code_object::reduce_min_kernel: 2.06482ms / 49 = 0.0421391ms, 11%
gpu::code_object::reduce_max_sub_mul_kernel: 2.0575ms / 49 = 0.0419897ms, 11%
gpu::code_object::mul_quantizelinear_kernel: 1.67524ms / 48 = 0.0349009ms, 9%
gpu::code_object::mul_kernel: 1.59728ms / 50 = 0.0319455ms, 9%
gpu::code_object::mlir_quant_dot: 1.35165ms / 47 = 0.0287586ms, 8%
gpu::code_object::quantizelinear_kernel: 1.20892ms / 37 = 0.0326734ms, 7%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16375ms / 49 = 0.0237501ms, 7%
gpu::code_object::concat_kernel: 1.15656ms / 49 = 0.0236034ms, 7%
gpu::code_object::convert_kernel: 1.15122ms / 50 = 0.0230244ms, 6%
gpu::code_object::contiguous_kernel: 1.13786ms / 48 = 0.0237053ms, 6%
gpu::code_object::neg_div_clip_nearbyint_add_kernel: 1.13631ms / 49 = 0.0231901ms, 6%
gpu::code_object::layernorm_mul_add_kernel: 0.585409ms / 24 = 0.0243921ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.538739ms / 23 = 0.0234234ms, 3%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.389363ms / 13 = 0.029951ms, 3%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.363619ms / 13 = 0.0279707ms, 2%
load: 0.360501ms / 600 = 0.000600835ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.295485ms / 12 = 0.0246237ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel: 0.288516ms / 12 = 0.024043ms, 2%
multibroadcast: 0.251145ms / 296 = 0.000848464ms, 2%
reshape_lazy: 0.128273ms / 180 = 0.00071263ms, 1%
hip::hip_copy_literal: 0.104623ms / 151 = 0.00069287ms, 1%
transpose: 0.063942ms / 48 = 0.00133213ms, 1%
slice: 0.0456442ms / 36 = 0.00126789ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248973ms / 1 = 0.0248973ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0237355ms / 1 = 0.0237355ms, 1%
gpu::code_object::gather_kernel: 0.0237274ms / 1 = 0.0237274ms, 1%
@param: 0.0103636ms / 26 = 0.0003986ms, 1%
hip::hip_allocate_memory: 0.0011818ms / 1 = 0.0011818ms, 1%
check_context::migraphx::gpu::context: 0.0008362ms / 1 = 0.0008362ms, 1%
Batch size: 1
Rate: 157.206 inferences/sec
Total time: 6.36109ms
Total instructions time: 19.2011ms
Overhead time: 0.435307ms, -12.84ms
Overhead: 7%, -202%
[ MIGraphX Version: 2.9.0. ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --int8
Hey @pfultz2 got any idea on best to speed this one up? Should our quantize also be in mlir here not just the dequantize?
Latest changes in the PR seem to speed things up (remove flattening via reshape/concat, serialize min/max operations)
This appears to create a bout a 20% speedup alone relative to the original run on the int8 model.
Summary:
gpu::code_object::reduce_min_min_kernel: 2.04025ms / 49 = 0.0416377ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.03854ms / 49 = 0.0416029ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.66959ms / 48 = 0.0347831ms, 10%
gpu::code_object::mlir_quant_dot: 1.34554ms / 47 = 0.0286286ms, 9%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16123ms / 49 = 0.0236985ms, 7%
gpu::code_object::convert_kernel: 1.14344ms / 50 = 0.0228688ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.12494ms / 49 = 0.022958ms, 7%
gpu::code_object::mul_kernel: 1.11627ms / 49 = 0.022781ms, 7%
gpu::code_object::contiguous_kernel: 0.845617ms / 36 = 0.0234894ms, 6%
gpu::code_object::quantizelinear_kernel: 0.835653ms / 36 = 0.0232126ms, 5%
gpu::code_object::layernorm_mul_add_kernel: 0.583655ms / 24 = 0.024319ms, 4%
gpu::code_object::dequantizelinear_add_add_kernel: 0.537387ms / 23 = 0.0233647ms, 4%
gpu::code_object::mlir_quant_dot_dequantizelinear_add: 0.388378ms / 13 = 0.0298752ms, 3%
load: 0.332797ms / 537 = 0.000619734ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.29401ms / 12 = 0.0245009ms, 2%
gpu::code_object::dequantizelinear_add_mul_mul_mul_mul_add_neg_sub_exp_add_div_mul_kernel: 0.286982ms / 12 = 0.0239152ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.280409ms / 12 = 0.0233674ms, 2%
multibroadcast: 0.252798ms / 295 = 0.000856943ms, 2%
hip::hip_copy_literal: 0.105973ms / 150 = 0.000706487ms, 1%
reshape_lazy: 0.0981803ms / 131 = 0.000749468ms, 1%
gpu::code_object::mlir_quant_dot_dequantizelinear_mul: 0.0888045ms / 1 = 0.0888045ms, 1%
transpose: 0.059129ms / 48 = 0.00123185ms, 1%
slice: 0.0458985ms / 36 = 0.00127496ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0248396ms / 1 = 0.0248396ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0239322ms / 1 = 0.0239322ms, 1%
gpu::code_object::gather_kernel: 0.0237319ms / 1 = 0.0237319ms, 1%
@param: 0.0106455ms / 26 = 0.000409442ms, 1%
check_context::migraphx::gpu::context: 0.0011866ms / 1 = 0.0011866ms, 1%
hip::hip_allocate_memory: 0.00104084ms / 1 = 0.00104084ms, 1%
Batch size: 1
Rate: 191.057 inferences/sec
Total time: 5.23404ms
Total instructions time: 16.7609ms
Overhead time: 0.382316ms, -11.5268ms
Overhead: 7%, -220%
[ MIGraphX Version: 2.9.0. ] Complete: migraphx-driver perf gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8
Still seeing large time still done in the reduction min/max steps.
Curious if the above block before the quantize linear can be fused which adds a significant amount of time to the run
gpu::code_object::reduce_min_min_kernel: 2.04496ms / 49 = 0.0417338ms, 13%
gpu::code_object::reduce_max_max_sub_mul_kernel: 2.04381ms / 49 = 0.0417105ms, 13%
gpu::code_object::mul_quantizelinear_kernel: 1.68006ms / 48 = 0.0350012ms, 10%
gpu::code_object::mlir_quant_dot: 1.35275ms / 47 = 0.028782ms, 9%
gpu::code_object::quantizelinear_convert_sub_quantizelinear_kernel: 1.16634ms / 49 = 0.0238029ms, 7%
gpu::code_object::convert_kernel: 1.1519ms / 50 = 0.0230379ms, 7%
gpu::code_object::div_neg_clip_nearbyint_kernel: 1.13642ms / 49 = 0.0231923ms, 7%
gpu::code_object::mul_kernel: 1.1247ms / 49 = 0.0229531ms, 7%
@pfultz2 the gpt2 model this issue stemmed from has the following repeated everything as part of the inserted dynamic quantization step
Have initial changes after also reworking MatMulInteger after talking with @pfultz2
@causten we're seeing about a 30% increase once we're properly handling the input as quant_dot instead of just dots for the onnx model.
Summary of a run with only the change to MatMulInteger parser. on/off with disable fast math has around the same ballpark of a speedup (211-212 QPS)
Summary:
gpu::code_object::reduce_min_kernel: 2.06349ms / 49 = 0.042112ms, 14%
gpu::code_object::reduce_max_sub_mul_kernel: 2.05264ms / 49 = 0.0418906ms, 13%
gpu::code_object::mlir_quant_dot: 1.87745ms / 61 = 0.0307779ms, 12%
gpu::code_object::concat_kernel: 1.15502ms / 49 = 0.0235718ms, 8%
gpu::code_object::quantizelinear_sub_convert_add_convert_kernel: 1.14867ms / 49 = 0.0234422ms, 8%
gpu::code_object::mul_kernel: 1.13639ms / 49 = 0.0231916ms, 8%
gpu::code_object::neg_div_clip_nearbyint_convert_kernel: 1.13261ms / 49 = 0.0231144ms, 8%
gpu::code_object::contiguous_kernel: 1.12907ms / 48 = 0.0235223ms, 8%
gpu::code_object::quantizelinear_kernel: 0.836898ms / 36 = 0.0232472ms, 6%
gpu::code_object::layernorm_mul_add_kernel: 0.585637ms / 24 = 0.0244015ms, 4%
gpu::code_object::convert_mul_add_add_kernel: 0.545402ms / 23 = 0.0237131ms, 4%
gpu::code_object::convert_mul_add_kernel: 0.310343ms / 13 = 0.0238725ms, 2%
load: 0.307436ms / 515 = 0.000596963ms, 2%
gpu::code_object::dequantizelinear_mul_where_reduce_max_sub_exp_reduce_sum_div_quantizelinear_kernel: 0.294008ms / 12 = 0.0245007ms, 2%
gpu::code_object::convert_mul_add_mul_mul_add_mul_exp_add_div_kernel: 0.28876ms / 12 = 0.0240634ms, 2%
gpu::code_object::mlir_quant_dot_dequantizelinear: 0.285526ms / 12 = 0.0237939ms, 2%
multibroadcast: 0.220696ms / 246 = 0.000897138ms, 2%
reshape_lazy: 0.119711ms / 180 = 0.000665061ms, 1%
hip::hip_copy_literal: 0.100321ms / 151 = 0.00066438ms, 1%
transpose: 0.0685118ms / 48 = 0.00142733ms, 1%
slice: 0.0486096ms / 36 = 0.00135027ms, 1%
gpu::code_object::convert_mul_kernel: 0.0291348ms / 1 = 0.0291348ms, 1%
gpu::code_object::add_layernorm_mul_add_kernel: 0.0249158ms / 1 = 0.0249158ms, 1%
gpu::code_object::dequantizelinear_add_kernel: 0.0237881ms / 1 = 0.0237881ms, 1%
gpu::code_object::gather_kernel: 0.023749ms / 1 = 0.023749ms, 1%
gpu::code_object::convert_kernel: 0.0230378ms / 1 = 0.0230378ms, 1%
@param: 0.00976646ms / 26 = 0.000375633ms, 1%
hip::hip_allocate_memory: 0.000916ms / 1 = 0.000916ms, 1%
check_context::migraphx::gpu::context: 0.000771ms / 1 = 0.000771ms, 1%
Batch size: 1
Rate: 212.998 inferences/sec
Total time: 4.69488ms
Total instructions time: 15.8433ms
Overhead time: 0.376437ms, -11.1484ms
Overhead: 8%, -237%
[ MIGraphX Version: 2.10.0. ] Complete: bin/driver perf ../int8_models/gpt2_1_int8_gpu.onnx --input-dim @input_ids 1 32 --fill1 input_ids --disable-fast-math --int8
Changes pushed to : https://github.com/ROCm/AMDMIGraphX/pull/2903
Seeing larger speedup with MatMulinteger (#2903) + Fixes to DynamicQuantizeLinea (#2896) that were added when running through ORT right now for GPT2. Testing other models through driver appeared to show the correct speedup as well.
For GPT (shown below) it appears we're slightly faster than fp16 runs now
int8
root@aus-navi3x-02:/onnxruntime/onnxruntime/python/tools/transformers# python3 benchmark.py -g -m gpt2 --model_class AutoModelForCausalLM --sequence_length 32 384 --batch_sizes 1 8 --provider=migraphx -p int8 --disable_gelu --disable_layer_norm --disable_attention --disable_skip_layer_norm --disable_embed_layer_norm --disable_bias_skip_layer_norm --disable_bias_gelu -o no_opt
Arguments: Namespace(models=['gpt2'], model_source='pt', model_class='AutoModelForCausalLM', engines=['onnxruntime'], cache_dir='./cache_models', onnx_dir='./onnx_models', use_gpu=True, provider='migraphx', precision=<Precision.INT8: 'int8'>, verbose=False, overwrite=False, optimizer_info=<OptimizerInfo.NOOPT: 'no_opt'>, validate_onnx=False, fusion_csv=None, detail_csv=None, result_csv=None, input_counts=[1], test_times=100, batch_sizes=[1, 8], sequence_lengths=[32, 384], disable_ort_io_binding=False, num_threads=[16], force_num_layers=None, disable_attention=True, disable_skip_layer_norm=True, disable_embed_layer_norm=True, disable_bias_skip_layer_norm=True, disable_bias_gelu=True, disable_layer_norm=True, disable_gelu=True, enable_gelu_approximation=False, disable_shape_inference=False, enable_gemm_fast_gelu=False, use_mask_index=False, use_raw_attention_mask=False, no_attention_mask=False, use_multi_head_attention=False, disable_group_norm=False, disable_skip_group_norm=False, disable_packed_kv=False, disable_packed_qkv=False, disable_bias_add=False, disable_bias_splitgelu=False, disable_nhwc_conv=False, use_group_norm_channels_first=False, disable_rotary_embeddings=False)
OptimizerInfo is set to no_opt, graph optimizations specified in FusionOptions are not applied.
Model class name: AutoModelForCausalLM
Skip export since model existed: ./onnx_models/gpt2_1.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:29:43.346929', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.96', 'latency_95_percentile': '3.00', 'latency_99_percentile': '3.11', 'average_latency_ms': '2.67', 'QPS': '374.99'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:30:08.887040', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '7.75', 'latency_95_percentile': '7.78', 'latency_99_percentile': '7.81', 'average_latency_ms': '7.54', 'QPS': '132.68'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:30:44.223341', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '5.99', 'latency_95_percentile': '6.31', 'latency_99_percentile': '6.46', 'average_latency_ms': '5.92', 'QPS': '1351.07'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.INT8: 'int8'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:31:09.392412', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '49.11', 'latency_95_percentile': '49.53', 'latency_99_percentile': '49.77', 'average_latency_ms': '48.40', 'QPS': '165.29'}
Detail results are saved to csv file: benchmark_detail_20240320-133156.csv
Summary results are saved to csv file: benchmark_summary_20240320-133156.csv
fp16 runs
root@aus-navi3x-02:/onnxruntime/onnxruntime/python/tools/transformers# python3 benchmark.py -g -m gpt2 --model_class AutoModelForCausalLM --sequence_length 32 384 --batch_sizes 1 8 --provider=migraphx -p fp16 --disable_gelu --disable_layer_norm --disable_attention --disable_skip_layer_norm --disable_embed_layer_norm --disable_bias_skip_layer_norm --disable_bias_gelu -o no_opt
Arguments: Namespace(models=['gpt2'], model_source='pt', model_class='AutoModelForCausalLM', engines=['onnxruntime'], cache_dir='./cache_models', onnx_dir='./onnx_models', use_gpu=True, provider='migraphx', precision=<Precision.FLOAT16: 'fp16'>, verbose=False, overwrite=False, optimizer_info=<OptimizerInfo.NOOPT: 'no_opt'>, validate_onnx=False, fusion_csv=None, detail_csv=None, result_csv=None, input_counts=[1], test_times=100, batch_sizes=[1, 8], sequence_lengths=[32, 384], disable_ort_io_binding=False, num_threads=[16], force_num_layers=None, disable_attention=True, disable_skip_layer_norm=True, disable_embed_layer_norm=True, disable_bias_skip_layer_norm=True, disable_bias_gelu=True, disable_layer_norm=True, disable_gelu=True, enable_gelu_approximation=False, disable_shape_inference=False, enable_gemm_fast_gelu=False, use_mask_index=False, use_raw_attention_mask=False, no_attention_mask=False, use_multi_head_attention=False, disable_group_norm=False, disable_skip_group_norm=False, disable_packed_kv=False, disable_packed_qkv=False, disable_bias_add=False, disable_bias_splitgelu=False, disable_nhwc_conv=False, use_group_norm_channels_first=False, disable_rotary_embeddings=False)
OptimizerInfo is set to no_opt, graph optimizations specified in FusionOptions are not applied.
Model class name: AutoModelForCausalLM
Skip export since model existed: ./onnx_models/gpt2_1.onnx
Run onnxruntime on gpt2 with input shape [1, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:35:52.919367', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '2.76', 'latency_95_percentile': '2.78', 'latency_99_percentile': '2.79', 'average_latency_ms': '2.71', 'QPS': '368.84'}
Run onnxruntime on gpt2 with input shape [1, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 1, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:36:16.149473', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '7.67', 'latency_95_percentile': '7.69', 'latency_99_percentile': '7.72', 'average_latency_ms': '7.49', 'QPS': '133.57'}
Run onnxruntime on gpt2 with input shape [8, 32]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 32, 'custom_layer_num': None, 'datetime': '2024-03-20 13:36:48.681642', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '6.07', 'latency_95_percentile': '6.33', 'latency_99_percentile': '6.52', 'average_latency_ms': '6.00', 'QPS': '1334.37'}
Run onnxruntime on gpt2 with input shape [8, 384]
{'engine': 'onnxruntime', 'version': '1.17.0', 'providers': 'migraphx', 'device': 'cuda', 'optimizer': <OptimizerInfo.NOOPT: 'no_opt'>, 'precision': <Precision.FLOAT16: 'fp16'>, 'io_binding': True, 'model_name': 'gpt2', 'inputs': 1, 'threads': 16, 'batch_size': 8, 'sequence_length': 384, 'custom_layer_num': None, 'datetime': '2024-03-20 13:37:10.933650', 'test_times': 100, 'latency_variance': '0.00', 'latency_90_percentile': '48.85', 'latency_95_percentile': '49.07', 'latency_99_percentile': '49.15', 'average_latency_ms': '47.97', 'QPS': '166.79'}
Detail results are saved to csv file: benchmark_detail_20240320-133755.csv
Summary results are saved to csv file: benchmark_summary_20240320-133755.csv
rocMLIR will be added to migraphx. This will be for all the data types but this issue will be to ensure existing tests in DLM pass using int8