IBM / text-generation-inference

IBM development fork of https://github.com/huggingface/text-generation-inference
Apache License 2.0
57 stars 30 forks source link

:sparkles: allow single-shard paged attention #86

Open joerunde opened 6 months ago

joerunde commented 6 months ago

This is a small little change to allow llama and bigcode models to work with paged attention on a single shard. Currently if FLASH_ATTENTION is not alos set, it will raise

tdoublep commented 6 months ago

Currently if FLASH_ATTENTION is not alos set, it will raise

Not 100% sure but I think we do actually want FLASH_ATTENTION to be set in addition to PAGED_ATTENTION. I can't remember why exactly..going to look into it.

joerunde commented 6 months ago

@tdoublep ah, I was assuming that they were mutually exclusive, if they both need to be set then let me know if you find out why!