ArvinZhuang / DSI-QG

The official repository for "Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation", Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon and Daxin Jiang.
MIT License
110 stars 16 forks source link

Train DSI-QG model #3

Open hibiki12y opened 1 year ago

hibiki12y commented 1 year ago

Step 1 of readme, description of script saying query generation.

But in run.py, "DocTqueryTrainer" use "IndexingTrainDataset" and it make document/query to docid dataset. So, result of training model make just docid.

It is correct? In step 2, use with "castorini/doc2query-t5-large-msmarco" model generate find question but, in my own model(trained with step 1 script) just generate docid.

ArvinZhuang commented 1 year ago

Hi, thanks for the question.

For docTquery training task (step1), I'm basically reusing the IndexingTrainDataset class for doing this. If you check xorqa_docTquery_train_data.json which has a similar format as xorqa_DSI_train_data.json. I just treated the questions are 'docids' so that the trained model will generate questions for the given document (for DSI training task, this is generate docids for the given document). Is that make sense to you?

Sorry for the unclear class naming here..