NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

Add berttokenize unit. #722

Closed jellying closed 5 years ago

jellying commented 5 years ago

717

bwanglzu commented 5 years ago

会破坏整个项目的代码结构。

jellying commented 5 years ago

会破坏整个项目的代码结构。

最关键的问题想用bert就要加载bert自身的vocab,而不是由vocabulary这个unit生成,需要想办法设计一下怎么兼容bert的词表和预处理。

codecov-io commented 5 years ago

Codecov Report

Merging #722 into 2.2-dev will increase coverage by 0.13%. The diff coverage is 96.87%.

Impacted file tree graph

@@             Coverage Diff             @@
##           2.2-dev     #722      +/-   ##
===========================================
+ Coverage    94.31%   94.45%   +0.13%     
===========================================
  Files           98      101       +3     
  Lines         3378     3570     +192     
===========================================
+ Hits          3186     3372     +186     
- Misses         192      198       +6
Impacted Files Coverage Δ
matchzoo/preprocessors/units/vocabulary.py 100% <100%> (ø) :arrow_up:
matchzoo/preprocessors/build_vocab_unit.py 100% <100%> (ø) :arrow_up:
matchzoo/preprocessors/units/bert_clean.py 94.11% <94.11%> (ø)
matchzoo/preprocessors/bert_preprocessor.py 96.22% <96.22%> (ø)
matchzoo/preprocessors/units/tokenize.py 96.72% <96.42%> (-3.28%) :arrow_down:
matchzoo/utils/bert_utils.py 97.67% <97.67%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6b09cde...6693bd9. Read the comment docs.

bwanglzu commented 5 years ago

@uduse can you review?