Does Chinese data need additional change in the code?

amzn / trans-encoder

Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations

Apache License 2.0

133 stars 16 forks source link

Does Chinese data need additional change in the code? #3

Closed drxmy closed 2 years ago

drxmy commented 2 years ago

Or will it work just fine with Chinese? Really interesting work!

hardyqr commented 2 years ago

Thanks for your interest! In principle you shouldn't need to do much. You need to switch the English BERT models (incl. the initial bi-encoder and the base model) to Chinese ones. E.g., for bi-encoder you could use simcse-chinese-roberta-wwm-ext (I randomly retrieved this one from google); and the corresponding base model is hfl/chinese-roberta-wwm-ext.

drxmy commented 2 years ago

Thank you for replying so quickly. Yes, the model definitely need to change. I was mainly concerned about data processing. I will try it tomorrow.

zpp13 commented 2 years ago

@drxmy 请问你做过尝试了吗，效果怎么样

drxmy commented 2 years ago

@drxmy 请问你做过尝试了吗，效果怎么样

我用的自己的数据集，现在的结果有一点奇怪，可能代码哪里没改对。bi-encoder验证指标过高，后面还会nan。不过最近一直没时间看是什么问题