chengchingwen / Transformers.jl

Julia Implementation of Transformer models
MIT License
526 stars 75 forks source link

Rework text encoder #161

Closed chengchingwen closed 11 months ago

chengchingwen commented 11 months ago

This PR reworked the text encoder interface. The previous TransformerTextEncoder, BertTextEncoder, GPT2TextEncoder, and T5TextEncoder are unified into TrfTextEncoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:

  1. annotate (default to TextEncoders.annotate_strings): Annotate the input string for the tokenizer, e.g. String would be treated as a single sentence, not a single word.
  2. process: The preprocess function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...
  3. onehot (default to TextEncoders.lookup_fist): Apply onehot encoding on the preprocess result, the default behavior takes the first element from the proprocess result and applies onehot encoding.
  4. decode (default to identity): The function that converts each token id back to string. This can be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.
  5. textprocess (default to TextEncodeBase.join_text): the function that joins the decode-d result in complete sentence(s).

A new api decode_text is also provided to simplify text generation. These designs allows us to unify the behavior difference between the old <X>TextEncoders, and extract the text decoder direclty from huggingface tokenizer file.

codecov[bot] commented 11 months ago

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (91a3fe0) 46.52% compared to head (a9bcc53) 60.89%.

Files Patch % Lines
src/textencoders/TextEncoders.jl 66.66% 14 Missing :warning:
src/huggingface/tokenizer/fast_tkr.jl 88.18% 13 Missing :warning:
src/textencoders/gpt_textencoder.jl 37.50% 5 Missing :warning:
src/textencoders/utils.jl 84.21% 3 Missing :warning:
src/huggingface/tokenizer/tokenizer.jl 60.00% 2 Missing :warning:
src/tokenizer/unigram/unigram.jl 81.81% 2 Missing :warning:
src/textencoders/t5_textencoder.jl 75.00% 1 Missing :warning:
src/tokenizer/unigram/tokenization.jl 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## 0.3 #161 +/- ## =========================================== + Coverage 46.52% 60.89% +14.37% =========================================== Files 85 85 Lines 4400 4547 +147 =========================================== + Hits 2047 2769 +722 + Misses 2353 1778 -575 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.