FasterDecoding / REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024
Apache License 2.0
166 stars 10 forks source link

Gracefully handles when draft_choices is not cut enough #10

Closed wangpatrick57 closed 5 months ago

wangpatrick57 commented 5 months ago

In lib.rs, cut_to_choices() is "best effort" in cutting a tree down to have <= choices number of tokens. Empirically, when choices=64, it fails to cut the tree down about 1 in a million times. I made this change so that the error that occurs when this happens is much easier to track down. I did not make any upstream changes to catching this error in gen_model_answer_rest.py because I think it's best to crash when this happens in that specific use case. However, in other situations (e.g. querying REST 100 million times to gather statistics about it), it is better to catch and ignore this error. The change in this PR simply allows developers to choose how they want to catch this error when using the DraftReceiver wheel instead of having it always crash when cut_to_choices() fails.