Introduces a new file, generation_algorithm.py, housing the implementation of a speculative sampling algorithm. The algorithm has been integrated into the BaseOnsiteLLM class through the addition of a speculative_sampling attribute.
The speculative sampling algorithm receives essential parameters through generation_kw_args during the initialization of the BaseOnsiteLLM class. These parameters include the draft_model_uri, along with two optional hyperparameters, k and scheduler, which influence the number of tokens generated per iteration.
The algorithm's functionality is accessed through the complete method within the BaseOnsiteLLM class when the speculative_sampling attribute is present. It returns the newly generated token IDs. Additionally, the method takes an optional parameter, "alignment," which determines the degree of similarity between the probabilities of the draft tokens and those of the target tokens.
In scenarios where alignment is set to 1 (perfect alignment, the default value), the algorithm aims to predict the same exact answers as the target model would. The implementation is designed to handle a batch size of 1, aligning with the current handling of the generate method in the BaseOnsiteLLM class.
Introduces a new file, generation_algorithm.py, housing the implementation of a speculative sampling algorithm. The algorithm has been integrated into the BaseOnsiteLLM class through the addition of a speculative_sampling attribute.
The speculative sampling algorithm receives essential parameters through generation_kw_args during the initialization of the BaseOnsiteLLM class. These parameters include the draft_model_uri, along with two optional hyperparameters, k and scheduler, which influence the number of tokens generated per iteration.
The algorithm's functionality is accessed through the complete method within the BaseOnsiteLLM class when the speculative_sampling attribute is present. It returns the newly generated token IDs. Additionally, the method takes an optional parameter, "alignment," which determines the degree of similarity between the probabilities of the draft tokens and those of the target tokens.
In scenarios where alignment is set to 1 (perfect alignment, the default value), the algorithm aims to predict the same exact answers as the target model would. The implementation is designed to handle a batch size of 1, aligning with the current handling of the generate method in the BaseOnsiteLLM class.
fixes #367