dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[FEATURE] relax spm sanity check; add t5 tokenizer tests #1474

Closed yongyi-wu closed 3 years ago

yongyi-wu commented 3 years ago

Description

This PR modifies the sanity check in SentencepieceTokenizer to ensure easier insertion of additional special tokens, which later would help add s corresponding to noise span sentinels as in T5 tokenizer. Accordingly, model and vocab for T5-base have been uploaded to S3 for some new test cases.

Checklist

Essentials

Changes

cc @dmlc/gluon-nlp-team

codecov[bot] commented 3 years ago

Codecov Report

Merging #1474 (d36a92f) into master (def0d70) will decrease coverage by 0.17%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1474      +/-   ##
==========================================
- Coverage   85.86%   85.68%   -0.18%     
==========================================
  Files          52       52              
  Lines        6911     8266    +1355     
==========================================
+ Hits         5934     7083    +1149     
- Misses        977     1183     +206     
Impacted Files Coverage Δ
src/gluonnlp/op.py 95.78% <ø> (+0.70%) :arrow_up:
src/gluonnlp/data/tokenizers/sentencepiece.py 78.44% <100.00%> (ø)
src/gluonnlp/models/electra.py 68.90% <0.00%> (-7.94%) :arrow_down:
src/gluonnlp/models/roberta.py 90.47% <0.00%> (-3.15%) :arrow_down:
src/gluonnlp/models/albert.py 92.38% <0.00%> (-3.06%) :arrow_down:
src/gluonnlp/models/gpt2.py 95.38% <0.00%> (-2.89%) :arrow_down:
src/gluonnlp/models/bert.py 91.97% <0.00%> (-2.83%) :arrow_down:
src/gluonnlp/models/bart.py 91.44% <0.00%> (-2.31%) :arrow_down:
src/gluonnlp/data/filtering.py 78.03% <0.00%> (-0.24%) :arrow_down:
src/gluonnlp/models/transformer.py 98.89% <0.00%> (-0.05%) :arrow_down:
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update def0d70...d36a92f. Read the comment docs.

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1474/t5/index.html