fani-lab / RePair

Extensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers
5 stars 5 forks source link

Implementation of Dense Retrievals #49

Open DelaramRajaei opened 11 months ago

DelaramRajaei commented 11 months ago

Here is the issue, I will keep a record of all my findings as I work on the task of refining all aspects of the retrieval system on different datasets using dense retrievals.

DelaramRajaei commented 11 months ago

Hey @hosseinfani, As mentioned here, I've downloaded the dbpedia and antique datasets. Could you please share the robust04 files with me so that I can initiate the dense indexing? There appears to be a problem extracting the stored tar files in the teams when using Windows.

Looking ahead, our next steps involve obtaining the clueweb12, clueweb09, and gov2 datasets. Similar to robust04, for gov2, we'll need to sign a contract, and they will send us a copy of the drive, as explained here. I can begin by indexing the antique and dbpedia datasets.

hosseinfani commented 11 months ago

Hi @DelaramRajaei I'm uploading the extracted files in our RePair > Datasets .. > Corpora >> Robust04 Can you upload the rest there as well? I submitted the request for gov2.

DelaramRajaei commented 11 months ago

@hosseinfani Yes, I will upload the raw datasets in teams.

DelaramRajaei commented 10 months ago

Hi @hosseinfani,

I wanted to provide you with an update on the indexing process. I downloaded the antique and dbpedia corpus and converted their format to the required jsonl format as mentioned in the documentation. I uploaded the jsonls in the Teams > RePir channel > files > Datasets & indexes > Corpora. Currently, I'm facing an issue when using pyserini for indexing. There seems to be a conflict with pygaggle, but I successfully removed pygaggle and used other libraries. However, I'm still encountering some issues with the library.

Hi @yogeswarl,

I noticed that you created the dense indexes for aol dataset. I followed the path you explained in the Readme and pyserini's documentation. However, I'm facing some problems. One issue is related to torch using CUDA. I installed torch with CUDA, but it's still not recognizing CUDA. Have you ever encountered this problem? Additionally, I have another question. Considering the large datasets and the possibility of running out of memory space, I wanted to know if you created the indexes using your local system or not?