Open vedantroy opened 1 year ago
Thank you for the issue. Yes, this is possible, I have it in-progress.
Thank you for the issue. Yes, this is possible, I have it in-progress.
Interesting. Last time I used triton, I wasn't sure if they exposed an API for caching autotune results--I'm guessing they now do? Might take a stab at hacking on this myself, if I can find the API, since I'm trying to ship something soon.
I wish they did. They do however have a cache_key
attribute on kernels. So I was going to throw something together by storing the results out as JSON into a cache directory, and keeping it keyed off of cache_key
so it re-uses results only if the environment and kernel source are the same (just like they do for caching kernel compilations). e.g. llama_mlp_fused_4_kernel.fn.cache_key
.
When running the model--especially in a serverless environment where there may be many cold starts--it would be desirable to cache the auto-tuning results. Is this possible?