bigcode-project / transformers

Apache License 2.0
26 stars 8 forks source link

Add gpu optimizations to base model #14

Closed jlamypoirier closed 1 year ago

jlamypoirier commented 1 year ago

This allows KV cache pre-allocation and key length padding outside of the inference runner. With this, the inference runner is exclusively a CPU optimization (except for small GPU gains from cuda graphs)

Separate PR for now because the inference runner needs to be adapted.