fpgaminer / GPTQ-triton

GPTQ inference Triton kernel
Apache License 2.0
280 stars 21 forks source link

Cache auto-tuning? #15

Open vedantroy opened 1 year ago

vedantroy commented 1 year ago

When running the model--especially in a serverless environment where there may be many cold starts--it would be desirable to cache the auto-tuning results. Is this possible?

fpgaminer commented 1 year ago

Thank you for the issue. Yes, this is possible, I have it in-progress.

vedantroy commented 1 year ago

Thank you for the issue. Yes, this is possible, I have it in-progress.

Interesting. Last time I used triton, I wasn't sure if they exposed an API for caching autotune results--I'm guessing they now do? Might take a stab at hacking on this myself, if I can find the API, since I'm trying to ship something soon.

fpgaminer commented 1 year ago

I wish they did. They do however have a cache_key attribute on kernels. So I was going to throw something together by storing the results out as JSON into a cache directory, and keeping it keyed off of cache_key so it re-uses results only if the environment and kernel source are the same (just like they do for caching kernel compilations). e.g. llama_mlp_fused_4_kernel.fn.cache_key.