Closed Bala93 closed 4 months ago
@czwlines The same problem
Sorry, I forgot to set access rights. It can be downloaded now!
Thanks for the update. Could you also provide some information about the top 10 documents selection ? What dataset was used as a reference and which embedding was used ?
We use Contriever to retrieve relevant documents from Wikipedia passages for both NQ and TQA datasets. For HQA, the raw data already includes the relevant documents.
Thanks for your quick response
For clarification, the order of the contexts is also ranked for all the datasets ? Looks like that when I check the code https://github.com/DeepLearnXMU/QGC/blob/main/src/dataset.py#L220. Thanks.
For clarification, the order of the contexts is also ranked for all the datasets ? Looks like that when I check the code https://github.com/DeepLearnXMU/QGC/blob/main/src/dataset.py#L220. Thanks.
The contexts are already sorted in the file name containing sorted
. In other files, the contexts remain unordered.
When I was analyzing the dataset, I found that some of the contexts are relatively very large compared to the usual ones. For instance, average token length of the contexts is ~180, while some contexts are >1500. Why is this the case? And can we just remove them through truncate in tokenizer ?
Can you make the contents in drive link public ?