UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.23k stars 2.47k forks source link

Finetune the pretrined model on chinese dataset #643

Open Haitons opened 3 years ago

Haitons commented 3 years ago

Hello!I want to finetune your model distiluse-base-multilingual-cased on chinese corpus like LCQMC. So,do Chinese sentences need word segmentation?

nreimers commented 3 years ago

It uses the mBERT (bert multilingual) code from Huggingface. I am not sure if segmentation is required for mBERT, I never used it for my experiments on chinese data and it worked well.

Haitons commented 3 years ago

OK. I got 521-dimensional sentence vector. Is there any way to reduce the dimension?

nreimers commented 3 years ago

https://sbert.net/examples/training/distillation/README.html