[Documentation]: Shall we modify the configurations in `v2python` for the other kernels?

xinji1 commented 6 months ago

Description of errors

In current Readme.md, we need to change the configurations in python/rules.py, but actually we need to make our own folders like /v2python/rules/flash/ right?

Attach any links, screenshots, or additional evidence you think will be helpful.

No response

xinyazhang commented 6 months ago

Yes, the current steps to add new kernels are

Add Triton kernel file to tritonsrc
Add Rules files that define KernelDescription objects for these new kernels under v2python/rules/<new kernel family>
- Do not add files under directory v2python/rules directly, it is unlikely the new Triton source only contains one kernel.
Update v2python/rules/__init__.py to add the new KernelDescription objects to v2python.rules.kernels variable.

(Note: Keep this Issue open until the README.md is updated)

xinji1 commented 5 months ago

Thanks for the reply! Another little question here: what should we do when it comes to the v2python/rules/[my_own_kernel]/tune-kernel-gfx90a.json? I just found that you provide quite a lot tuning templetes for different kernerls. Specifically,

is it related to the final result? i mean if i could get the final .so even with an empty tune-kernel-gfx90a.json?
If not, how many configures should i provide?
I found that not all of input variables are necessary for the corresponding kernel (like N_HEADS and D_HEAD for attn_fwd), so which parameter is necessary in this .json file (i appreciate if you could take flash.attn_fwd as an example)?
in you "attn_fwd" part of this .json file, you only provide one tensor's shape and dtype. Does it mean that i need to give other tensors' metainfo if they are not in the same shape/dtype?

xinyazhang commented 5 months ago

is it related to the final result? i mean if i could get the final .so even with an empty tune-kernel-gfx90a.json?

Theoretically you can, but it's an untested code path right now.

The json file we called "turing database" in internal slides (nothing secret, just unsuitable to release as part of source code Repos). The tuning database did not exist until commit https://github.com/ROCm/aotriton/pull/2/commits/099141ae9a045ebe4e24496a3c65d902022cd6fc, and before that you need to specify the PERF_CHOICES manually in KernelDesciption subclasses.

However, since the introduction of the tuning database, it somehow becomes the central of the build process and guides AOTriton how to dispatch user inputs to actual GPU kernels, and the original approach is not well tested since its performance is abysmal (could be 10x slower)

If not, how many configures should i provide?

This totally depends on your needs, and actual kernel you want to compile. The tuning database is an AOT version of @triton.autotune, and the actual configurations used can be found in tritonsrc/attn_torch_function.py. Meanwhile tritonsrc/tune_flash.py provides a list of (seqlen_q, seqlen_k, D_HEAD, Q_dtype, ...) to probe the optimal configurations.

Note, Triton kernel compiled with certain configurations can run without segfaults or other runtime errors, but will not give your correct results. See tritonsrc/attn_torch_function.py for a possible solution.

I found that not all of input variables are necessary for the corresponding kernel (like N_HEADS and D_HEAD for attn_fwd), so which parameter is necessary in this .json file (i appreciate if you could take flash.attn_fwd as an example)?

The JSON version is a little bit verbose. However, we have replaced the JSON with SQLite3 database.

You can check the UNIQUE constraints about what columns are used to locate tuning database entries with sqlite3 v2python/rules/tuning_database.sqlite3 '.schema'.

in you "attn_fwd" part of this .json file, you only provide one tensor's shape and dtype. Does it mean that i need to give other tensors' metainfo if they are not in the same shape/dtype?

No, you don't need to. The current database already described all tensors' shapes and dtypes. You can take a look at v2python/rules/flash/attn_fwd.py about the constraints among tensors.

More specifically, for attn_fwd kernel, QKVO have the same dtype, and roughly the same shape sans seqlen_q vs seqlen_k. The seqlen_q and seqlen_k are already provided separately.

xinji1 commented 5 months ago

Thanks for your reply! Another question here: according to this, rocm/triton will be deprecated soon? Will AOTriton turn to support openai/triton ?

xinyazhang commented 5 months ago

Will AOTriton turn to support openai/triton ?

Certainly we will, but the migration takes time and extensive regression tests.

ROCm / aotriton

[Documentation]: Shall we modify the configurations in `v2python` for the other kernels? #12

Description of errors

Attach any links, screenshots, or additional evidence you think will be helpful.