Closed kcirred closed 1 month ago
@gkumbhat I also wrote a unit test so let me push that as well.
I suspect the def sum_token_count
in embeddings.py
is doing what is intended. By counting sum(encoding.attention_mask)
the 0s will be dropped.
In my test case, I originally wanted to show that the results stays the same but there will be a change to the tokenizer input_ids
and attention_mask
shape. I will not use input_token_count
which comes from sum_token_count
as they will come out to be the same despite the change in attention_mask
. For now I only test that the change in tokenizer option in our use case does not change the final result.
Allows users to pass tokenizer keyword argument to select the tokenizer settings desired. In this case, enabling for pad_to_max_length.