NQ320k preprocessing? - Githubissues

ArvinZhuang / DSI-QG

The official repository for "Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation", Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon and Daxin Jiang.

MIT License

100 stars 15 forks source link

NQ320k preprocessing? #16

Open kisozinov opened 2 months ago

kisozinov commented 2 months ago

Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper

To do this, I referred to your script from old repository, but I ran into the problem that simply by changing NUM_TRAIN=307000 and NUM_EVAL=7000 script terminates in the middle, probably due to the repeated titles (stop at ~107000).

for ind in rand_inds:
        title = data[ind]['document']['title']  # we use title as the doc identifier to prevent two docs have the same text
        if title not in title_set:
            title_set.add(title)

Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?

ArvinZhuang commented 2 months ago

Hi @kisozinov, sorry for the late reply.

I think the reason is that the NQ dataset on huggingface dataset hub now only has 10.6k train data

But originally it should have 307k. I am not sure what happened..

kisozinov commented 2 months ago

@ArvinZhuang This is not a problem. I've successfully downloaded this dataset again today, it has 307/7 k samples (maybe bug?). In the case of a full dataset, did you use the standard split into train/test (307/7 k)? The restriction on the unique doc title from your script is not suitable, as far as I understand :)

ArvinZhuang commented 2 months ago

@kisozinov Yes I was using standard train/test split. If nothing about the dataset is wrong, it is what it is. The reason I have the doc title filtering is that it is not suitable to have two different document IDs for the same document (identified by title). The filtering makes sure this won't happen. So we got fewer documents at the end because the train/test had fewer unique documents.

Maybe the proper way is to sample some other documents from Wikipedia and add them to the corpus to make up a corpus with 320k docs.

kisozinov commented 2 months ago

I got it, thanks for the answer :)