fani-lab / ReQue

A Benchmark Workflow and Dataset Collection for Query Refinement
https://hosseinfani.github.io/ReQue/
Other
0 stars 13 forks source link

Quick start and setup #39

Open DelaramRajaei opened 9 months ago

DelaramRajaei commented 9 months ago

@michelecatani

This is an issue page to log your progress.

Please read the IR and backtranslation document by Monday. The document is not fully completed in the metric part. You can find some helpful information in this link.

Please let us know if you have any concerns or questions.

michelecatani commented 9 months ago

Thank you Delaram. I've read through the docs you've linked and have received your Teams message regarding next steps.

michelecatani commented 9 months ago

I've cloned the repair repo and am testing the commands listed there on some preexisting datasets. The plan is to then download the robust dataset, put it in the repo, and change the path to index that one

michelecatani commented 9 months ago

I am having quite a few issues installing Pyserini on my M1 Mac. I remember the last project I did with the Fani Lab I was running into similar issues. Notably, I seem to be unable to install nmslib, which is a dependency required for Pyserini. I will continue to debug this issue, but I was wondering if you knew anybody at the lab with an M1 Mac that has had similar issues and has resolved them

michelecatani commented 9 months ago

I've gotten it installed. That was annoying and I regret buying an M1 Mac lol... but on to the next

DelaramRajaei commented 9 months ago

Perfect. Also, I remember having a similar issue I mentioned it in this link. If you encounter other problems this issue may help you.

michelecatani commented 9 months ago

Thanks! I've abandoned attempts at running it locally but have been able to push forward using Colab. However, when I try to download the robust dataset, I am faced with a licensing dilemma. Do you have any idea about this?

Image 2023-09-19 at 9 39 PM

I know part of the error is omitted due to the scrolling, so I've pasted it here. If you want, I can try a different dataset or maybe you can spot something I did incorrectly.

[INFO] Please confirm you agree to the TREC data usage agreement found at https://trec.nist.gov/data/cd45/index.html [INFO] The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here: https://trec.nist.gov/data/cd45/index.html. More details about the procedure can be found here: https://ir-datasets.com/trec-robust04.html#DataAccess. Once completed, place the uncompressed source here: /root/.ir_datasets/disks45/corpus This should contain directories like NEWS_data/FBIS, NEWS_data/FR94, etc.

michelecatani commented 9 months ago

It seems that I can obtain a license somehow. However, it also mentions that there may be an organizational license already obtained by the organization. I'm wondering if we have one?

I'm obtaining this info by reading this page: https://ir-datasets.com/trec-robust04.html#DataAccess

michelecatani commented 9 months ago

I have been given copies of the datasets and am working on running the commands now.

michelecatani commented 9 months ago

@hosseinfani Hi Professor, I am running the encoding command on a Google Colab session and it's running, which is great. However, it's moving really slow and I believe I still have an indexing command to run after which will take a long time. I was told you could provide me with access to Compute Canada... any chance you could help me out with that? Unfortunately, I have been unable to run these local commands as I have encountered arm processor dependency hell

hosseinfani commented 9 months ago

@michelecatani Hi Mike, sure. please follow the steps here: https://github.com/fani-lab/Library/blob/main/ComputeCanada.md

michelecatani commented 9 months ago

@hosseinfani Thank you!

michelecatani commented 9 months ago

I have provided Delaram with the encoded file and faiss file for robust04. I await her confirmation to see if everything's correct, then I will work on indexing as many of the other datasets as I can over the weekend.