Regarding dataset access.

DeepLearnXMU / QGC

Code for "Retaining Key Information under High Compression Rates: Query-Guided Compressor for LLMs" (ACL 2024)

13 stars 0 forks source link

Regarding dataset access. #1

Closed Bala93 closed 4 months ago

Bala93 commented 4 months ago

Can you make the contents in drive link public ?

dingjingzhen commented 4 months ago

@czwlines The same problem

czwlines commented 4 months ago

Sorry, I forgot to set access rights. It can be downloaded now!

Bala93 commented 4 months ago

Thanks for the update. Could you also provide some information about the top 10 documents selection ? What dataset was used as a reference and which embedding was used ?

czwlines commented 4 months ago

We use Contriever to retrieve relevant documents from Wikipedia passages for both NQ and TQA datasets. For HQA, the raw data already includes the relevant documents.

Bala93 commented 4 months ago

Thanks for your quick response

Bala93 commented 4 months ago

For clarification, the order of the contexts is also ranked for all the datasets ? Looks like that when I check the code https://github.com/DeepLearnXMU/QGC/blob/main/src/dataset.py#L220. Thanks.

czwlines commented 4 months ago

For clarification, the order of the contexts is also ranked for all the datasets ? Looks like that when I check the code https://github.com/DeepLearnXMU/QGC/blob/main/src/dataset.py#L220. Thanks.

The contexts are already sorted in the file name containing sorted. In other files, the contexts remain unordered.

Bala93 commented 2 months ago

When I was analyzing the dataset, I found that some of the contexts are relatively very large compared to the usual ones. For instance, average token length of the contexts is ~180, while some contexts are >1500. Why is this the case? And can we just remove them through truncate in tokenizer ?