Open DelaramRajaei opened 11 months ago
Hey @hosseinfani,
As mentioned here, I've downloaded the dbpedia
and antique
datasets. Could you please share the robust04 files with me so that I can initiate the dense indexing? There appears to be a problem extracting the stored tar files in the teams when using Windows.
Looking ahead, our next steps involve obtaining the clueweb12, clueweb09, and gov2 datasets. Similar to robust04, for gov2, we'll need to sign a contract, and they will send us a copy of the drive, as explained here.
I can begin by indexing the antique
and dbpedia
datasets.
Hi @DelaramRajaei I'm uploading the extracted files in our RePair > Datasets .. > Corpora >> Robust04 Can you upload the rest there as well? I submitted the request for gov2.
@hosseinfani Yes, I will upload the raw datasets in teams.
Hi @hosseinfani,
I wanted to provide you with an update on the indexing process. I downloaded the antique and dbpedia corpus and converted their format to the required jsonl format as mentioned in the documentation. I uploaded the jsonls in the Teams > RePir channel > files > Datasets & indexes > Corpora. Currently, I'm facing an issue when using pyserini for indexing. There seems to be a conflict with pygaggle, but I successfully removed pygaggle and used other libraries. However, I'm still encountering some issues with the library.
Hi @yogeswarl,
I noticed that you created the dense indexes for aol
dataset. I followed the path you explained in the Readme and pyserini's documentation. However, I'm facing some problems. One issue is related to torch using CUDA. I installed torch with CUDA, but it's still not recognizing CUDA. Have you ever encountered this problem? Additionally, I have another question. Considering the large datasets and the possibility of running out of memory space, I wanted to know if you created the indexes using your local system or not?
Here is the issue, I will keep a record of all my findings as I work on the task of refining all aspects of the retrieval system on different datasets using dense retrievals.