FasterDecoding / REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024
Apache License 2.0
177 stars 11 forks source link

Support DraftRetriever datastore read/write for large tokenizers and vocabulary sizes (i.e. llama3+) #23

Closed scandukuri closed 6 days ago

scandukuri commented 1 week ago

Here we make the necessary changes to read and write suffixes to memory and file for large tokenizers; the original implementation only supported token IDs up to Rust u16::MAX (65,535).

Crucially, using Rust i32 for reading and writing individual token IDs (instead of u16 originally) allows the tool to support token IDs of up to Rust i32::MAX (2,147,483,647), and still allows negative placeholder IDs for padding in the implementation like -2.

zhenyuhe00 commented 6 days ago

Hi, Thank you for the fix. Would you consider creating a new branch? The change from "u16" to "i32" isn't needed for models with small vocabulary size and would increase disk storage usage unnecessarily.

scandukuri commented 6 days ago

Yes! Can make a ‘llama3’ branch. I can make the existing changes (DraftRetriever) + the necessary changes to modeling_llama_kv.py.

zhenyuhe00 commented 6 days ago

That sounds great! Appreciate your effort.