fani-lab / ReQue

A Benchmark Workflow and Dataset Collection for Query Refinement
https://hosseinfani.github.io/ReQue/
Other
0 stars 13 forks source link

Installation and Setup #23

Closed DelaramRajaei closed 1 year ago

DelaramRajaei commented 1 year ago

This is the issue where I document my observations and installation difficulties encountered during this project.

DelaramRajaei commented 1 year ago

@hosseinfani As a result of my search and learning of IR terms, I have prepared a document that contains them. Currently, I am working on it.

I encountered a question regarding the difference between a corpus and a dataset. I noticed the following difference, but I wasn't sure what it meant:

Corpus mainly appears in NLP area or application domain related to texts/documents, because of its meaning "a collection of written texts, esp. the entire works of a particular author or a body of writing on a particular subject." In contrast, dataset appears in every application domain --- a collection of any kind of data is a dataset.

I also found this table of comparison:

Prototypical corpus Prototypical dataset
Language unrestricted production specific phenomenon
Context wide restricted
Purpose general research question

Links to two websites: https://corpuslinguisticmethods.wordpress.com/2013/12/28/corpora-versus-datasets/ https://copyprogramming.com/howto/difference-in-meaning-of-these-terms-dataset-vs-corpus#difference-in-meaning-of-these-terms-dataset-vs-corpus

Link to my document: https://docs.google.com/document/d/1swPuzXEyx6UvmaQtjEEQIYO6LN38quJInSHY1Y5IXhY/edit?usp=sharing

hosseinfani commented 1 year ago

@DelaramRajaei thank you for the nice report. my only suggestion is to dump the gdoc as a word doc or pdf and attach it such that the link does not depend on your gdrive.

DelaramRajaei commented 1 year ago

Rough draft of the IR document DelaramRajaei_Information_Retrieval.pdf

DelaramRajaei commented 1 year ago

Updated IR document. Created the presentation's slides DelaramRajaei_Information_Retrieval.pdf IR_Presentation.ppt.pptx

DelaramRajaei commented 1 year ago

@hosseinfani

Installing ReQue process:

There are a few questions I have:
As I write this, I am still working on installing anserini and trec_eval. How to compile trec_eval using Visual Studio build tools? I installed the tool but couldn't figure out how to use it. The trec_eval README.windows.md explains how to compile it but it doesn't mention the details and the provided link displays 404 error.

Also, I wanted to know, if I need "cair" for adding and testing the expander in this project.

In addition, this closed issue also helped me in this process.

hosseinfani commented 1 year ago

@DelaramRajaei Awesome startup log. thank you.

DelaramRajaei commented 1 year ago

Update the IR document:

Information Retrieval.pdf

Installing ReQue:

DelaramRajaei commented 1 year ago

@hosseinfani

I have successfully installed and run the ReQue project.

First I checked the if Python Interpreter was set to use the ReQue environment. This video was helpful. pycharm -> settings ->Project: ProjectName -> Python Interpreter Afterwards, I set the Python Interpreter as ReQue environment with the path to where the python was installed. To find the path, these commands can be used in the terminal:

      $> conda env list 

List all the environments.

      $> conda activate (name of the environment)

      $> which python

Afterwards, the problem was with the NLTK library which was not properly downloaded and installed.

The following commands are:

      >>> import nltk

      >>> nltk.download("stopwords")

      >>> nltk.download("punkt")

Also, I installed the PyGaggle library.

To make sure everything is fine, I run the following commands first with only the ['generate'] operation and then with all the commands ['generate', 'search', 'evaluate', 'build'].

The results are attached.

robust04.zip

Based on the installation video, I analyzed the results; however, all the results in topics.robust04.bm25.map.dataset.csv were zero. I was wondering if something in the process was wrong or if the first query always gives the most accurate results, which seems strange to me!

I am currently working on adding the new expander.

hosseinfani commented 1 year ago

@DelaramRajaei perfect. we can discuss the result tmr :)

DelaramRajaei commented 1 year ago

@hosseinfani

The problem that occurred during the running process has been fixed.

There was a problem with the trec_eval address.

I have changed the following line in ./qe/main.py in evaluate():

eval_cmd = '{}eval/trec_eval.9.0.4/trec_eval'.format(anserini) >>> eval_cmd = '{}eval/trec_eval'.format(anserini)

I run the project using the following command with only the LovinsStemmer expander:

$> python -u main.py --corpus robust04 --output ./output/robust04/ --ranker bm25 --metric map 2>&1 | tee robust04.bm25.log &
$> python -u main.py --corpus robust04 --output ./output/robust04/ --ranker qld --metric map 2>&1 | tee robust04.qld.log 

The new results are attached. robust04.zip

hosseinfani commented 1 year ago

@DelaramRajaei perfect. thanks. Can you add a new setting in ./cmn/param.py for the address of trec_eval such that the pipeline reads the address from there like this:

https://github.com/fani-lab/ReQue/blob/1ae7580782a53a9575c4594fda15aefcce9a672f/qe/cmn/param.py#L39

https://github.com/fani-lab/ReQue/blob/1ae7580782a53a9575c4594fda15aefcce9a672f/qe/main.py#L56

DelaramRajaei commented 1 year ago

I cloned the latest version of the ReQue project and proceeded with the following steps to setup and run the project.

  1. First I create a new environment and activate it.
$ conda create -n ReQue
$ conda activate ReQue

If you are using PyCharm, you have the option to create a new project and select Conda as the new environment with Python 3.8. This will create a new environment named after your file location.

Also, you can check the Python interpreter in PyCharm, by following these steps: File > Settings > Project: #ProjectName > Python Interpreter

  1. Instead of using the environment.yml file, I use the requirements.txt file to install the required packages.
$ conda install python=3.8 -n ReQue
$ pip install -r requirements.txt   

If you choose to use the "conda install" command, you will encounter the following error, and you will need to manually install additional packages.

PackagesNotFoundError: The following packages are not available from current channels:

  • pydantic==1.5
  • wn==0.0.23
  • community
  • pywsd
  • prettytable==2.1.0
  • pyserini
  • spacy==2.2.4
  • tagme
  • transformers==4.0.0
  • tokenizers==0.9.4
  1. To install Anserini, you can clone the project from the provided link and follow the instructions outlined in the repository.

Install the Cygwin and use its make to build trec_eval and ndeval.

Once you have completed these steps, you will need to build the Anserini project (for example, using Apache NetBeans IDE) and then execute the following command in the terminal.

$ mvn clean package appassembler:assemble
  1. To install Pyserini, I followed both instructions for Development installation and Pip Installation in the provided link .

If you do not already have JDK 11 installed, install it via conda:

$ conda install -c conda-forge openjdk=11

Ensure that you install version 11 to avoid encountering an error.

The pip installation would encounter errors and it's better to use the development installation.

  1. If you encounter any errors while running the program, you can resolve them by executing the following commands: first, use pip uninstall pyserini to uninstall Pyserini, and then reinstall it using pip install pyserini.

  2. In this step you may get the following error

DLL load failed while importing _swigfaiss: The specified module could not be found.

It can be fixed by this command

$ conda install -c conda-forge faiss

I followed the above steps to successfully install the ReQue project, which can now be used for further tasks.