This changes text generation serving to only return the new text (without the prompt). This is consistent with streaming. Also, encoder-decoder models like BART already don't return the input text, since it is used as a "context" rather than a "prompt" to complete.
This is a small breaking change, but the next release is going to be 0.5.0 and I think it's fine.
Initially I thought about adding an option like :return_full_text, but to make it handle leading space in a generic way, we would need to make another tokenizer pass on the input, then prefix replace (that's what hf/transformers do). I don't think this is necessarily worth it, because the end user knows what model they work with, so they can easily concatenate the prompt, either adding a space or not. We can revisit the option if there is an actual use case, but it's usually the new text that users care about.
Closes #247.
This changes text generation serving to only return the new text (without the prompt). This is consistent with streaming. Also, encoder-decoder models like BART already don't return the input text, since it is used as a "context" rather than a "prompt" to complete.
This is a small breaking change, but the next release is going to be 0.5.0 and I think it's fine.
Initially I thought about adding an option like
:return_full_text
, but to make it handle leading space in a generic way, we would need to make another tokenizer pass on the input, then prefix replace (that's what hf/transformers do). I don't think this is necessarily worth it, because the end user knows what model they work with, so they can easily concatenate the prompt, either adding a space or not. We can revisit the option if there is an actual use case, but it's usually the new text that users care about.