implement BPE preprocessor

NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.

Apache License 2.0

3.85k stars 900 forks source link

implement BPE preprocessor #717

Closed ZizhenWang closed 5 years ago

ZizhenWang commented 5 years ago

BPE is used many NLP tasks as machine translation, generation and pre-training, we will implement a BPE processor to support these requirements.

bwanglzu commented 5 years ago

but why do you think it's related to text matching?

ZizhenWang commented 5 years ago

It is basic tokenizer of Bert.

bwanglzu commented 5 years ago

if it's a tokenizer, then should be designed as a processor unit.

sth4k commented 5 years ago

I think it's more to solve the OOV (out-of-vocabulary) problem. Currently the vocabulary is only built from training data and thus maybe limited by how you choose your training data.