Closed chengchingwen closed 11 months ago
Attention: 41 lines
in your changes are missing coverage. Please review.
Comparison is base (
91a3fe0
) 46.52% compared to head (a9bcc53
) 60.89%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This PR reworked the text encoder interface. The previous
TransformerTextEncoder
,BertTextEncoder
,GPT2TextEncoder
, andT5TextEncoder
are unified intoTrfTextEncoder
.TrfTextEncoder
has multiple fields that can modify the encode/decode process:annotate
(default toTextEncoders.annotate_strings
): Annotate the input string for the tokenizer, e.g.String
would be treated as a single sentence, not a single word.process
: The preprocess function applied to the tokenization results, e.g. adding specialend-of-sentence
token, computing attention mask...onehot
(default toTextEncoders.lookup_fist
): Apply onehot encoding on the preprocess result, the default behavior takes the first element from the proprocess result and applies onehot encoding.decode
(default toidentity
): The function that converts each token id back to string. This can be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.textprocess
(default toTextEncodeBase.join_text
): the function that joins thedecode
-d result in complete sentence(s).A new api
decode_text
is also provided to simplify text generation. These designs allows us to unify the behavior difference between the old<X>TextEncoder
s, and extract the text decoder direclty from huggingface tokenizer file.