Closed chenghuaWang closed 3 months ago
To avoid copying the entire vector, if you want to get all tokens by once, pls using call_back
function. Here is an example
Chat:
for (int i = 0; i < in_strs.size(); ++i) {
auto in_str = in_strs[i];
auto input_tensor = tokenizer.tokenize(in_str, i);
std::cout << "[Q] " << in_str << std::endl;
std::cout << "[A] " << std::flush;
LlmTextGeneratorOpts opt{
.max_new_tokens = 100,
.do_sample = true,
.temperature = 0.3f,
.top_k = 50,
.top_p = 0.f,
};
model.generate(input_tensor, opt, [&](unsigned int out_token) -> bool {
auto out_string = tokenizer.detokenize({out_token});
auto [isOk, print_string] = processOutput(out_string);
if (isOk) {
std::cout << print_string << std::flush;
} else {
return false;
}
return true;
});
printf("\n");
}
Get all Tokens:
for (int i = 0; i < in_strs.size(); ++i) {
auto in_str = in_strs[i];
auto input_tensor = tokenizer.tokenize(in_str, i);
LlmTextGeneratorOpts opt{
.max_new_tokens = 100,
.do_sample = true,
.temperature = 0.3f,
.top_k = 50,
.top_p = 0.f,
};
std::vector<unsigned int> tokens
model.generate(input_tensor, opt, [&](unsigned int out_token) -> bool {
tokens.emplace_back(out_token);
return true;
});
auto out_string = tokenizer.detokenize(out_token);
}
greedy search, topk sampling and topp sampling for language generation. see ref: https://huggingface.co/blog/how-to-generate
Note: The tensor provided to the top-p generator should sum to 1, indicating that a softmax operation should be applied first.