Open xinji1 opened 6 months ago
Yes, the current steps to add new kernels are
tritonsrc
v2python/rules/<new kernel family>
v2python/rules
directly, it is unlikely the new Triton source only contains one kernel. v2python/rules/__init__.py
to add the new KernelDescription objects to v2python.rules.kernels
variable.(Note: Keep this Issue open until the README.md
is updated)
Thanks for the reply! Another little question here: what should we do when it comes to the v2python/rules/[my_own_kernel]/tune-kernel-gfx90a.json
? I just found that you provide quite a lot tuning templetes for different kernerls. Specifically,
.so
even with an empty tune-kernel-gfx90a.json
?N_HEADS
and D_HEAD
for attn_fwd
), so which parameter is necessary in this .json
file (i appreciate if you could take flash.attn_fwd
as an example)?.json
file, you only provide one tensor's shape and dtype. Does it mean that i need to give other tensors' metainfo if they are not in the same shape/dtype?
is it related to the final result? i mean if i could get the final
.so
even with an emptytune-kernel-gfx90a.json
?
Theoretically you can, but it's an untested code path right now.
The json file we called "turing database" in internal slides (nothing secret, just unsuitable to release as part of source code Repos). The tuning database did not exist until commit https://github.com/ROCm/aotriton/pull/2/commits/099141ae9a045ebe4e24496a3c65d902022cd6fc, and before that you need to specify the PERF_CHOICES
manually in KernelDesciption
subclasses.
However, since the introduction of the tuning database, it somehow becomes the central of the build process and guides AOTriton how to dispatch user inputs to actual GPU kernels, and the original approach is not well tested since its performance is abysmal (could be 10x slower)
If not, how many configures should i provide?
This totally depends on your needs, and actual kernel you want to compile. The tuning database is an AOT version of @triton.autotune
, and the actual configurations used can be found in tritonsrc/attn_torch_function.py
. Meanwhile tritonsrc/tune_flash.py
provides a list of (seqlen_q, seqlen_k, D_HEAD, Q_dtype, ...)
to probe the optimal configurations.
Note, Triton kernel compiled with certain configurations can run without segfaults or other runtime errors, but will not give your correct results. See tritonsrc/attn_torch_function.py
for a possible solution.
I found that not all of input variables are necessary for the corresponding kernel (like
N_HEADS
andD_HEAD
for attn_fwd), so which parameter is necessary in this .json file (i appreciate if you could take flash.attn_fwd as an example)?
The JSON version is a little bit verbose. However, we have replaced the JSON with SQLite3 database.
You can check the UNIQUE
constraints about what columns are used to locate tuning database entries with sqlite3 v2python/rules/tuning_database.sqlite3 '.schema'
.
in you "attn_fwd" part of this
.json
file, you only provide one tensor's shape and dtype. Does it mean that i need to give other tensors' metainfo if they are not in the same shape/dtype?
No, you don't need to. The current database already described all tensors' shapes and dtypes.
You can take a look at v2python/rules/flash/attn_fwd.py
about the constraints among tensors.
More specifically, for attn_fwd
kernel, QKVO have the same dtype, and roughly the same shape sans seqlen_q
vs seqlen_k
. The seqlen_q
and seqlen_k
are already provided separately.
Thanks for your reply! Another question here: according to this, rocm/triton will be deprecated soon? Will AOTriton turn to support openai/triton ?
Will AOTriton turn to support openai/triton ?
Certainly we will, but the migration takes time and extensive regression tests.
Description of errors
In current
Readme.md
, we need to change the configurations inpython/rules.py
, but actually we need to make our own folders like/v2python/rules/flash/
right?Attach any links, screenshots, or additional evidence you think will be helpful.
No response