Closed DelaramRajaei closed 1 year ago
@hosseinfani As a result of my search and learning of IR terms, I have prepared a document that contains them. Currently, I am working on it.
I encountered a question regarding the difference between a corpus and a dataset. I noticed the following difference, but I wasn't sure what it meant:
Corpus mainly appears in NLP area or application domain related to texts/documents, because of its meaning "a collection of written texts, esp. the entire works of a particular author or a body of writing on a particular subject." In contrast, dataset appears in every application domain --- a collection of any kind of data is a dataset.
I also found this table of comparison:
Prototypical corpus | Prototypical dataset | |
---|---|---|
Language | unrestricted production | specific phenomenon |
Context | wide | restricted |
Purpose | general | research question |
Links to two websites: https://corpuslinguisticmethods.wordpress.com/2013/12/28/corpora-versus-datasets/ https://copyprogramming.com/howto/difference-in-meaning-of-these-terms-dataset-vs-corpus#difference-in-meaning-of-these-terms-dataset-vs-corpus
Link to my document: https://docs.google.com/document/d/1swPuzXEyx6UvmaQtjEEQIYO6LN38quJInSHY1Y5IXhY/edit?usp=sharing
@DelaramRajaei thank you for the nice report. my only suggestion is to dump the gdoc as a word doc or pdf and attach it such that the link does not depend on your gdrive.
Rough draft of the IR document DelaramRajaei_Information_Retrieval.pdf
Updated IR document. Created the presentation's slides DelaramRajaei_Information_Retrieval.pdf IR_Presentation.ppt.pptx
@hosseinfani
$> cd ReQue
$> conda env create -f environment.yml
$> conda activate ReQue
There are a few questions I have:
As I write this, I am still working on installing anserini and trec_eval.
How to compile trec_eval using Visual Studio build tools? I installed the tool but couldn't figure out how to use it.
The trec_eval README.windows.md explains how to compile it but it doesn't mention the details and the provided link displays 404 error.
Also, I wanted to know, if I need "cair" for adding and testing the expander in this project.
In addition, this closed issue also helped me in this process.
@DelaramRajaei Awesome startup log. thank you.
make
to build trec_eval
cair
for nowUpdate the IR document:
IR
Add IR methods
Vector Space Models IR method
Relevance
Word embedding
Installing ReQue:
@hosseinfani
I have successfully installed and run the ReQue project.
First I checked the if Python Interpreter was set to use the ReQue environment. This video was helpful. pycharm -> settings ->Project: ProjectName -> Python Interpreter Afterwards, I set the Python Interpreter as ReQue environment with the path to where the python was installed. To find the path, these commands can be used in the terminal:
$> conda env list
List all the environments.
$> conda activate (name of the environment)
$> which python
Afterwards, the problem was with the NLTK library which was not properly downloaded and installed.
The following commands are:
>>> import nltk
>>> nltk.download("stopwords")
>>> nltk.download("punkt")
Also, I installed the PyGaggle library.
To make sure everything is fine, I run the following commands first with only the ['generate'] operation and then with all the commands ['generate', 'search', 'evaluate', 'build'].
The results are attached.
Based on the installation video, I analyzed the results; however, all the results in topics.robust04.bm25.map.dataset.csv were zero. I was wondering if something in the process was wrong or if the first query always gives the most accurate results, which seems strange to me!
I am currently working on adding the new expander.
@DelaramRajaei perfect. we can discuss the result tmr :)
@hosseinfani
The problem that occurred during the running process has been fixed.
There was a problem with the trec_eval address.
I have changed the following line in ./qe/main.py in evaluate():
eval_cmd = '{}eval/trec_eval.9.0.4/trec_eval'.format(anserini)
>>> eval_cmd = '{}eval/trec_eval'.format(anserini)
I run the project using the following command with only the LovinsStemmer expander:
$> python -u main.py --corpus robust04 --output ./output/robust04/ --ranker bm25 --metric map 2>&1 | tee robust04.bm25.log &
$> python -u main.py --corpus robust04 --output ./output/robust04/ --ranker qld --metric map 2>&1 | tee robust04.qld.log
The new results are attached. robust04.zip
@DelaramRajaei perfect. thanks. Can you add a new setting in ./cmn/param.py for the address of trec_eval such that the pipeline reads the address from there like this:
https://github.com/fani-lab/ReQue/blob/1ae7580782a53a9575c4594fda15aefcce9a672f/qe/cmn/param.py#L39
https://github.com/fani-lab/ReQue/blob/1ae7580782a53a9575c4594fda15aefcce9a672f/qe/main.py#L56
I cloned the latest version of the ReQue project and proceeded with the following steps to setup and run the project.
$ conda create -n ReQue
$ conda activate ReQue
If you are using PyCharm, you have the option to create a new project and select Conda as the new environment with Python 3.8. This will create a new environment named after your file location.
Also, you can check the Python interpreter in PyCharm, by following these steps: File > Settings > Project: #ProjectName > Python Interpreter
$ conda install python=3.8 -n ReQue
$ pip install -r requirements.txt
If you choose to use the "conda install" command, you will encounter the following error, and you will need to manually install additional packages.
PackagesNotFoundError: The following packages are not available from current channels:
- pydantic==1.5
- wn==0.0.23
- community
- pywsd
- prettytable==2.1.0
- pyserini
- spacy==2.2.4
- tagme
- transformers==4.0.0
- tokenizers==0.9.4
Install the Cygwin and use its make to build trec_eval and ndeval.
Once you have completed these steps, you will need to build the Anserini project (for example, using Apache NetBeans IDE) and then execute the following command in the terminal.
$ mvn clean package appassembler:assemble
If you do not already have JDK 11 installed, install it via conda:
$ conda install -c conda-forge openjdk=11
Ensure that you install version 11 to avoid encountering an error.
The pip installation would encounter errors and it's better to use the development installation.
If you encounter any errors while running the program, you can resolve them by executing the following commands: first, use pip uninstall pyserini
to uninstall Pyserini, and then reinstall it using pip install pyserini
.
In this step you may get the following error
DLL load failed while importing _swigfaiss: The specified module could not be found.
It can be fixed by this command
$ conda install -c conda-forge faiss
I followed the above steps to successfully install the ReQue project, which can now be used for further tasks.
This is the issue where I document my observations and installation difficulties encountered during this project.