[BERT] Sentence Pair Classification Followups

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.56k stars 538 forks source link

Closed eric-haibin-lin closed 5 years ago

eric-haibin-lin commented 5 years ago

[ ] detokenization https://github.com/dmlc/gluon-nlp/pull/409/files#r234886365
[ ] refactor string pre-processing logic in tokenization.py by reusing gluonnlp building blocks
[ ] memory optimization. Currently BERT takes lots of GPU memory

szha commented 5 years ago

There's also the conversion script and the pre-trained model for Chinese.

eric-haibin-lin commented 5 years ago

Task for Chinese pre-trained model:

[x] use script in #449. Probably need to include <S> and <T> as reserved tokens in vocab.
[x] update tokenization module to support Chinese characters. https://github.com/google-research/bert/commit/f18bd94b8fee9bda3c293e0932d100add35b780c#diff-2c58d1381b4c37f6f3b7433c04f26448
[x] verify the model output with sample text including chinese characters
[x] register vocab and model in gluonnlp.

eric-haibin-lin commented 5 years ago

We also need to update the fine-tuning script to use the BERTTokenizer API introduced in https://github.com/dmlc/gluon-nlp/pull/464. Anyone wants to take that?

Ishitori commented 5 years ago

@eric-haibin-lin, I can take a look.

DushyantaDhyani commented 4 years ago

What's the status for BertDetokenizer? Can't seem to find it anywhere

eric-haibin-lin commented 4 years ago

Looks like i missed it in the list. Created a new issue to track that: https://github.com/dmlc/gluon-nlp/issues/1047 The embedding script in https://github.com/dmlc/gluon-nlp/blob/v0.8.x/scripts/bert/embedding.py#L183-L197 has a function that kind of does de-tokenization, but not available through an API yet. @DushyantaDhyani would you like to contribute one?