caikit / caikit-nlp

Apache License 2.0
12 stars 49 forks source link

Enable using kwargs for selecting pad-to-max-length strategy for tokenizer in embeddings #393

Closed kcirred closed 1 month ago

kcirred commented 1 month ago

Allows users to pass tokenizer keyword argument to select the tokenizer settings desired. In this case, enabling for pad_to_max_length.

kcirred commented 1 month ago

@gkumbhat I also wrote a unit test so let me push that as well.

I suspect the def sum_token_count in embeddings.py is doing what is intended. By counting sum(encoding.attention_mask) the 0s will be dropped.

In my test case, I originally wanted to show that the results stays the same but there will be a change to the tokenizer input_ids and attention_mask shape. I will not use input_token_count which comes from sum_token_count as they will come out to be the same despite the change in attention_mask. For now I only test that the change in tokenizer option in our use case does not change the final result.