UKPLab / gpl

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577
Apache License 2.0
315 stars 39 forks source link

Can gpl be used on Chinese models? #2

Open wduo opened 2 years ago

wduo commented 2 years ago

Great job! Can I use gpl on Chinese models please? Which query generator model should I use? Which base models should be used? Which retrieval model should be used? Looking forward to your reply. thanks. @jcklie @reckart @dpetrak @nreimers @mbugert

nreimers commented 2 years ago

At the moment we have the doc2query model only for English. Also the Cross-Encoder is only available for English.

But they could be trained on this new dataset: https://arxiv.org/abs/2203.10232

@kwang2049 What do you think, should we train doc2query & cross-encoder for Chinese?

liushenglei commented 2 years ago

I am very appreciate that if you could train doc2query & cross-encoder for Chinese. Thx alot!

kwang2049 commented 2 years ago

Hi @liushenglei, thanks for your attention! Sorry for the late reply. I have just come back from my holiday:).

@nreimers yes! I am also very interested in that and would be very happy if there would be some models for my mother tongue:). I think the big question is about the training data. Do you have any suggestions? Personally, I only know Baidu's DuReader_retrieval. It has >80K query-passage pairs obtained from the Baidu search engine.

maxdata commented 2 years ago

@kwang2049 @liushenglei @wduo Can you please keep me in the loop? I am also interested in the CN model. My email is max@workhere.org. Can we connect? Thank you.

nreimers commented 2 years ago

@kwang2049 I think the DuReader dataset is good. But as I don't know Chinese, I'm not able to use that dataset as explanation etc. are mostly in Chinese.

kwang2049 commented 2 years ago

@kwang2049 @liushenglei @wduo Can you please keep me in the loop? I am also interested in the CN model. My email is max@workhere.org. Can we connect? Thank you.

Yeah, sure. I think we can just write and post things here for now. I will also update it here if I got some new findings about this topic:)