Closed mynhardtburger closed 8 months ago
FYI: @markstur
_truncate_input_tokens()
's error case could also possibly be refactored to make use of _sum_token_count()
to avoid having to rerun the tokenization.
@mynhardtburger caikit 0.26.14 is available with the data model update
Depends on caikit data model updates in this PR: https://github.com/caikit/caikit/pull/675
This extends the embedding module to include the
input_token_count
in the results of theEmbeddingModule
'srun_
methods.The
sum_token_count(tokenized: BatchEncoding) -> int
function calculates the count of tokens requiring model attention, based on theEncoding.attention_mask
property, as returned bySentenceTransformerWithTruncate.tokenizer()
. [PAD] is irrelevant for truncation and max_token_count parameters, while [CLS] and [SEP] are counted by the model when it considers the max length and truncation.Additionally tests to confirm sort order is maintained was added.
Various other quality of life type hints were added.