fani-lab / RePair

Extensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers
5 stars 5 forks source link

Merging ReQue and RePair #42

Open DelaramRajaei opened 10 months ago

DelaramRajaei commented 10 months ago

This is the issue where I log all my processes while adding ReQue's expanders to RePair.

DelaramRajaei commented 9 months ago

Hello @hosseinfani,

I successfully integrated all the refiners and merged the ReQue project with the RePair project. With this, I introduced the query_refinement setting in the parameter file. When set to true, the selected expanders will be invoked, generating refined queries stored in a refiner.#name_of_the_refiner file.

Example:

refiner.backtranslation_pes_arab

Currently, it is distinct from T5, but my future plan is to include T5 as a refiner alongside others. In the refiner, we now have the AbstractQRefiner class, which generates the original query. I observed that in the main code, we treat the original query separately from the generated refined queries. I propose considering AbstractQRefiner as a refiner and calling it along with others.

Moreover, I incorporated the semsim (Semantic Similarity) score as a mandatory score for all the refiners. It has been relocated from Backtranslation to the AbstractQRefiner class. After generating q', semsim is calculated and stored.

I introduced preprocess_query_batch to accommodate refiners like the backtranslation model that can work with batches. However, I haven't had the time to debug it yet.

After incorporating the refiners, I added the Query class. In the Dataset class, I implemented a function that reads all the queries from the dataset's path, creates a query object, and stores them in a list. The msmarco and aol child classes override this function according to their datasets. With this addition, RePair can now work with datasets like robust04, gov2, and others that were part of ReQue.

The main pipeline structure has been modified according to the Query class. Although it can still be optimized, I anticipate it will eventually transition to using the Query class exclusively.

I initially planned to include the search, eval, and other pipeline commands in the Query class as we discussed. However, I realized that keeping these functions in the Dataset might be more practical for accessing all queries, running with batches, and other functionalities. I am still deliberating on the most suitable architecture.

Tasks for the future:

hosseinfani commented 9 months ago

@DelaramRajaei Awesome! Thanks.

DelaramRajaei commented 9 months ago

Hey @hosseinfani ,

I wanted to provide you with a project update. Currently, the pipeline is operational, although I'm addressing some minor bugs related to reading different datasets. I've initiated backtranslation on two datasets, robust04 and dbpedia, across 10 languages. Below are the logs. robust04_dbpedia_backtranslation.zip

DelaramRajaei commented 9 months ago

Hey @hosseinfani,

I wanted to give you an update on the project.

The successful merger of ReQue and RePair is now complete. I have executed backtranslation for all five datasets, employing two IR rankers (BM25, QLD) and two evaluation metrics (MAP, MRR).

Encountered challenges in loading different datasets, particularly with clueweb09b and gov2, which have split their queries across multiple trecs. Currently, the code reads all files, but I plan to modify it to run each trec separately and aggregate the results, following the approach used in the ReQue project.

Presently, the project is running for all expanders for gov2 across various IR rankers and evaluation metrics. The log of the ongoing run is provided.

logs.zip

The log file contains records for Backtranslation, Conceptnet, Thesaurus, Wordnet, and Tagme refiners. I have also updated the RePair_StoryBoard in the Query Refinement channel on Teams.

In parallel, I am working on the query class and rag fusion, though there hasn't been significant success in those areas yet. I am ensuring the expanders run flawlessly and addressing other bugs.

Additionally, a minor change has been made in the output structure. After creating a folder for each dataset, it will store the refined data there and subsequently store the results of the ranker and metric in a new folder within the dataset folder. Below is an overview of the file storage:

├── output
│   ├── gov2 [Dataset's name]
│   │   ├── refined_queries_files
│   │   └── ranker.metric [such as bm25.map]
│   │       └── [This is where all the results from the search, eval, aggregate, and boxing are stored]
hosseinfani commented 9 months ago

Hi @DelaramRajaei Thanks for the update. This is great. We need a meeting to demo a sample run for me.

DelaramRajaei commented 8 months ago

Hello, @hosseinfani

I am currently facing issues with the RelevanceFeedback refiner. As we have transitioned from using Anserini to only Pyserini, one potential solution involves utilizing SimpleSearcher from Pyserini. However, this approach encounters problems with multiprocessing (multiprocessing as mp), which is deprecated, and the library suggests using Lucene. Unfortunately, I couldn't find a similar method in the library.

While exploring slides on RelevanceFeedback and the Rocchio Algorithm, I am contemplating implementing the algorithm myself. This refiner holds significance and serves as the parent for other important refiners like RM3, BertQE, Termluster, and more.

All other refiners are functioning well, providing results for the gov2 dataset. During fixing issues, I encountered a minor issue with the Anchor and Wiki refiners. They faced challenges in calling and using their parent variables. Additionally, the recent version of gensim(4.x) removed the vocab attribute in the Word2Vec model, replacing it with index_to_key. I found a helpful resource here.

Currently, my focus is on resolving the issue with RelevanceFeedback along with working on RAG-fusion.

DelaramRajaei commented 8 months ago

Hello @hosseinfani,

I looked into a few more solutions to address the problem with the RelevanceFeedback refiner, but unfortunately, I couldn't find a successful fix. As a temporary measure, I'll stick to using only Anserini for this refiner until I come across a better solution.

Here's the code snippet that utilizes Anserini:

    def get_tfidf(self, docid):
        # command = "target/appassembler/bin/IndexUtils -index lucene-index.robust04.pos+docvectors+rawdocs -dumpDocVector FBIS4-40260 -docVectorWeight TF_IDF "
        cli_cmd = f'\"./src/anserini/target/appassembler/bin/IndexUtils\" -index \"{self.index}\" -dumpDocVector \"{docid}\" -docVectorWeight TF_IDF'
        stream = os.popen(cli_cmd)
        return stream.read()

Meantime, I discovered some resources that might be useful in resolving the issue.

Anerini Extraction of TF-IDF vectors

Pyserini Pyserini: Reproducing Vector PRF Results To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers Pseudo Relevance Feedback with Deep Language Models and Dense Retrievers: Successes and Pitfalls Pseudo-Relevance Feedback with Dense Retrievers in Pyserini

Keywords Extraction Using TF-IDF Method sklearn.feature_extraction.text sklearn.feature_extraction.text gensim.models.TfidfModel

I came across this tool called Spacerini (link), which combines features from Pyserini and the Hugging Face ecosystem. It provides a simple and user-friendly method for researchers to explore and analyze large text datasets through interactive search applications. I'm not certain if we'll use it, but it could be helpful down the line.

hosseinfani commented 8 months ago

@DelaramRajaei thanks for the update. that's fine for the time being but create an issue page for it as a bug/issue so we can fix it in future.

for code reference, you can paste the codeline permanent link at github like this: https://github.com/fani-lab/RePair/blob/b752d8ecf7712b2c1134e6ac7f8ad9877f59ed6e/src/refinement/refiners/relevancefeedback.py#L56

DelaramRajaei commented 8 months ago

Hello @hosseinfani,

I've fixed the issues with RM3 and BertQ. Here's a brief overview of the changes:

RM3: I noticed that RM3 in pyserini was only used for document reranking, and a similar approach was used to select the top word in the Relevance feedback. To address this, I updated the get_topn_relevant_docids function. The refiner now calls the get_refined_query from its parent, the Relevance feedback.

BertQ: Dealing with BertQ was challenging due to its reliance on pygaggle for importing transformers, causing conflicts with other libraries. After reviewing their paper, I referred to this link and the bert documentation for implementing the code as per their guidelines.

Both refiners are now working, and I've stored their results. While two other refiners (adoptonfields and onfields) are pending, my focus is currently on implementing Rag fusion and creating dense indexes to compare results with the existing refiners.

Other helpful links:

DelaramRajaei commented 7 months ago

Hello @hosseinfani, I wanted to let you know about the work I've accomplished in the past weeks.

I've stored the outcomes of the refinement process applied to Rapir across all five datasets (robust04, gov2, antique, dbpedia, clueweb09b), where Sparse indices were available. Additionally, I've updated the Rapir's storyboard on Teams.

There have been changes to the pipeline, with the addition of more commands:

  1. query_refinement: This command triggers the execution of selected refiners in the refiner.param, including the original query. If the files already exist, they will be read and stored in the list of Query class objects for efficiency. If no refiners are selected only original query refiner will be called.

  2. similarity: This command computes rouge, bleu, and semsim for all refined queries along with the original query. All the results will be stored in similarity folder. The output will be structured as follows:

    ├── output
    │   └── dataset_name
    │       └── similarity
    │       └── refined_queries_files
  3. rag_fusion: This command gathers the outcomes of selected ranker for either all the refiners or just backtranslation, then calculates reciprocal_rank_fusion (RRF). Promising initial results have been achieved, although I'm still refining this step.

Additionally, several minor updates have been made to the Rapir project:

Currently, my focus is on working on rag-fusion.