Closed xinyazhang closed 4 months ago
Known failures:
test_op_bwd_with_matrix_bias[False-1.2-dtype2-0.0-2048-143-256-4-4]
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-4-4-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[False-1.2-dtype1-0.0-True-8-8-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-4-4-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-1-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-1-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-4-1] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
Produced by pytest test/test_backward.py -v -k 1.2
on MI300X
The UT on MI200 has better results:
FAILED ../test/test_backward.py::test_op_bwd[True-1.2-dtype1-0.0-True-8-8-256-4-4] - AssertionError: dk_allclose=True dv_allclose=False dq_allclose=True db_allclose=True
Tested with pytest test/test_backward.py -v -k 1.2
Looks mostly good. What is the new library size?
I don't have the all architecture+no zstd version size. The MI300X only+zstd size is 321M
aotriton::v2::flash::ExtraArguments
to allaotriton::v2::flash
APIsAOTRITON_BUILD_FOR_TUNING
to build all possible GPU kernels. The configurations are supplied byKernelDescription.gen_autotune_configs
, which is compatible withtriton.Config
.AOTRITON_BUILD_FOR_TUNING
also enablesforce_kernel_index
and other fields toaotriton::v2::flash::ExtraArguments
. Users can manually select kernel and bypass the autotune mechanism.test/tune_flash.py
cpp_autotune.py
and changetest/attn_torch_function.py
to support AOT autotune (aka cpp autotune)test/tune_flash.py
will run UT before testing atriton.Config
's performance, to avoid including faulty kernels.--use_multigpu
totest/tune_flash.py
. Now this script support tuning GPU kernels on all GPUs simultaneously, and the following extra features:tune_flash.py
needs 1(main)+n(worker)+n(minesweeper)+1(db access)+1*(table_tool.py) processes--json_file
is also added since the new architecture has a unified database access process that accept outputs from all worker processes, and this new process can write to a separate json file. This is current recommended way to store the result of tuning script. Users are supposed to runv2python.table_tool
later to update the tuning database.--continue_from_json_file is introduced. Meanwhile
resultand
_debug_task_idfields are also attached to the output json object, so that a tuning process can be resumed according to the
_debug_task_id` and its tuning statusv2python.table_tool
is improved to support the new version of json fileCAVEAT: The new AOT based autotune script
test/tune_flash.py
isn't capable of handling backward pass yet.