Open kisozinov opened 2 months ago
Hi @kisozinov, sorry for the late reply.
I think the reason is that the NQ dataset on huggingface dataset hub now only has 10.6k train data
But originally it should have 307k. I am not sure what happened..
@ArvinZhuang This is not a problem. I've successfully downloaded this dataset again today, it has 307/7 k samples (maybe bug?). In the case of a full dataset, did you use the standard split into train/test (307/7 k)? The restriction on the unique doc title from your script is not suitable, as far as I understand :)
@kisozinov Yes I was using standard train/test split. If nothing about the dataset is wrong, it is what it is. The reason I have the doc title filtering is that it is not suitable to have two different document IDs for the same document (identified by title). The filtering makes sure this won't happen. So we got fewer documents at the end because the train/test had fewer unique documents.
Maybe the proper way is to sample some other documents from Wikipedia and add them to the corpus to make up a corpus with 320k docs.
I got it, thanks for the answer :)
Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper![image](https://github.com/ArvinZhuang/DSI-QG/assets/71232712/15142d38-aaef-483f-8cc4-85b304bc606e)
To do this, I referred to your script from old repository, but I ran into the problem that simply by changing
NUM_TRAIN=307000
andNUM_EVAL=7000
script terminates in the middle, probably due to the repeated titles (stop at ~107000).Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?