NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

implement BPE preprocessor #717

Closed ZizhenWang closed 5 years ago

ZizhenWang commented 5 years ago

BPE is used many NLP tasks as machine translation, generation and pre-training, we will implement a BPE processor to support these requirements.

bwanglzu commented 5 years ago

but why do you think it's related to text matching?

ZizhenWang commented 5 years ago

It is basic tokenizer of Bert.

bwanglzu commented 5 years ago

if it's a tokenizer, then should be designed as a processor unit.

sth4k commented 5 years ago

I think it's more to solve the OOV (out-of-vocabulary) problem. Currently the vocabulary is only built from training data and thus maybe limited by how you choose your training data.