Rework text encoder - Githubissues

chengchingwen / Transformers.jl

Julia Implementation of Transformer models

MIT License

526 stars 75 forks source link

This PR reworked the text encoder interface. The previous TransformerTextEncoder, BertTextEncoder, GPT2TextEncoder, and T5TextEncoder are unified into TrfTextEncoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:

annotate (default to TextEncoders.annotate_strings): Annotate the input string for the tokenizer, e.g. String would be treated as a single sentence, not a single word.
process: The preprocess function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...
onehot (default to TextEncoders.lookup_fist): Apply onehot encoding on the preprocess result, the default behavior takes the first element from the proprocess result and applies onehot encoding.
decode (default to identity): The function that converts each token id back to string. This can be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.
textprocess (default to TextEncodeBase.join_text): the function that joins the decode-d result in complete sentence(s).

A new api decode_text is also provided to simplify text generation. These designs allows us to unify the behavior difference between the old <X>TextEncoders, and extract the text decoder direclty from huggingface tokenizer file.

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (91a3fe0) 46.52% compared to head (a9bcc53) 60.89%.

Files	Patch %	Lines
src/textencoders/TextEncoders.jl	66.66%	14 Missing :warning:
src/huggingface/tokenizer/fast_tkr.jl	88.18%	13 Missing :warning:
src/textencoders/gpt_textencoder.jl	37.50%	5 Missing :warning:
src/textencoders/utils.jl	84.21%	3 Missing :warning:
src/huggingface/tokenizer/tokenizer.jl	60.00%	2 Missing :warning:
src/tokenizer/unigram/unigram.jl	81.81%	2 Missing :warning:
src/textencoders/t5_textencoder.jl	75.00%	1 Missing :warning:
src/tokenizer/unigram/tokenization.jl	0.00%	1 Missing :warning:

Files

Patch %

Lines

src/textencoders/TextEncoders.jl

66.66%

14 Missing :warning:

src/huggingface/tokenizer/fast_tkr.jl

88.18%

13 Missing :warning:

src/textencoders/gpt_textencoder.jl

37.50%

5 Missing :warning:

src/textencoders/utils.jl

84.21%

3 Missing :warning:

src/huggingface/tokenizer/tokenizer.jl

60.00%

2 Missing :warning:

src/tokenizer/unigram/unigram.jl

81.81%

2 Missing :warning:

src/textencoders/t5_textencoder.jl

75.00%

1 Missing :warning:

src/tokenizer/unigram/tokenization.jl

0.00%

1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## 0.3 #161 +/- ## =========================================== + Coverage 46.52% 60.89% +14.37% =========================================== Files 85 85 Lines 4400 4547 +147 =========================================== + Hits 2047 2769 +722 + Misses 2353 1778 -575 ```

chengchingwen / Transformers.jl

Rework text encoder #161

Codecov Report