dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[Tokenizer] Fix huggingface wordpiece warning #1477

Closed sxjscience closed 3 years ago

sxjscience commented 3 years ago

Description

Fix the huggingface wordpiece tokenizer. I noticed that the previous implementation may trigger a warning in AutoGluon.

Checklist

Essentials

Comments

cc @dmlc/gluon-nlp-team

sxjscience commented 3 years ago

@hymzoque Would you also take a look?

codecov[bot] commented 3 years ago

Codecov Report

Merging #1477 (c268cff) into master (d98a326) will decrease coverage by 0.09%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1477      +/-   ##
==========================================
- Coverage   85.95%   85.85%   -0.10%     
==========================================
  Files          52       52              
  Lines        6912     6909       -3     
==========================================
- Hits         5941     5932       -9     
- Misses        971      977       +6     
Impacted Files Coverage Δ
src/gluonnlp/data/tokenizers/huggingface.py 71.83% <100.00%> (-0.24%) :arrow_down:
src/gluonnlp/data/loading.py 81.13% <0.00%> (-2.27%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d98a326...c268cff. Read the comment docs.

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1477/fix_wordpiece/index.html