UKPLab / gpl

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577
Apache License 2.0
315 stars 39 forks source link

TSDAE + GPL with french data #37

Closed houssine2000 closed 10 months ago

houssine2000 commented 1 year ago

Hello, Thanks for the amazing work. I am trying to do domain adaptation using TSDAE +GPL with an unlabeled french dataset. As for TSDAE there are few good base models like camemBERT etc. Once I pretrain with TSDAE, I intend to use GPL like so :

gpl.train(
    path_to_generated_data="generated",
    base_ckpt="MY TSDAE MODEL",  
    gpl_score_function="dot",
    batch_size_gpl=32,
    gpl_steps=-1,
    new_size=-1,
    queries_per_passage=1,
    output_dir="output",
    generator="doc2query/msmarco-french-mt5-base-v1",
    retrievers=["antoinelouis/biencoder-msmarco-distilbert-cos-v5-mmarcoFR", "antoinelouis/biencoder-msmarco-MiniLM-L12-cos-v5-mmarcoFR"],
    retriever_score_functions=["cos_sim", "cos_sim"],
    cross_encoder="cross-encoder/mmarco-mMiniLMv2-L12-H384-v1",
    qgen_prefix="qgen",
    do_evaluation=False,
    # use_amp=True   # One can use this flag for enabling the efficient float16 precision
)

The generator, retrievers and cross-encoder are all french models. The code seems to work but i'm not sure if I'm doing the right thing with the choice of models since there is no infos about using GPL for other languages. Does this configuration seem okay to you ?

Also, can you please confirm my understanding for (1) TSDAE on ${target} -> (2) MarginMSE on MSMARCO -> (3) GPL on ${target}; The base model (camemBERT in my case) will be pretrained via TSDAE (step1) and when i plug it to the GPL step (2) will be done automatically (training on MSMACRO dataset which is apparently provided in the GPL package) then the actual GPL will be done on my unlabeled corpus (target which is the same one used in step 1).

And if this is true, how do i train on a french version of MS MACRO ? Actually this whole "MarginMSE on MSMARCO" thing confuses me, because why do we it if the retrievers are already trained on such datasets.

Thanks.