Query Refinement Backtranslation

DelaramRajaei commented 1 year ago

This is the issue where I report my progress on the project.

DelaramRajaei commented 1 year ago

@hosseinfani Initially, I addressed two bugs related to reading and storing the CSV file in the project. To resolve this issue, I replaced the deprecated dataframe.append command with dataframe.concat. Additionally, for reading the CSV file, I examined the format and observed that all entries start with a <top>. Consequently, I implemented the following code to handle this situation:

if '<top>' in line and not is_tag_file:
   is_tag_file = True

If the file type is txt, the tag will be activated.

I also modified the .../qe/main.py file and replaced the .format() function with f-string style.

Subsequently, I incorporated the backtranslation expander exclusively for the French language. You can find the relevant code snippet in the "../qe/expander/backtranslation.py" file. The settings of the backtranslation model and languages are in the "../qe/cmn/param.py" file.

To facilitate result comparison, I have developed the "toy-compare.py" python code, which can be found in the toy file. However, I plan to relocate this file to the "../qe/eval" directory.

There are three available functions that can be used to compare the results:

compare_mAP_each_row(): This function compares the mAP between each row, specifically comparing the mAP of the original query with the mAP of the selected column(this column could be a list- for example different languages in backtranslation expander). The results are written to a CSV file.
compare_mAP_all_row(): This function calculates the mean of the mAP values for each column and writes the results to a txt file.
plot_result(): Although this function is still under development, it is intended to generate plots of the results for a selected dataset, displaying both the original and backtranslation results.

Next, I made further updates to the code to enable it to handle multiple languages and generate queries accordingly. Subsequently, I compared the results using the "toy-compare.py" script.

However, there are still a few remaining bugs in the project:

When running the code for multiple languages it will raise an error for the Dutch language for which I have not yet identified the source.
After generating the "topics.robust04.bm25.map.all.csv" file, the first query consistently returns a NaN value, and I am currently investigating the cause of this issue.

By next Friday, I have outlined the following tasks to be completed:

My priority is to address and resolve the existing bugs.
I aim to finalize and complete the plot function.
I will conduct a thorough comparison of the results obtained and subsequently prepare a comprehensive report on the findings.

hosseinfani commented 1 year ago

@DelaramRajaei Thank you for the detailed report.

Please keep this issue page updated with your progress on a 2-3 daily basis.
Please read our paper on backtranslation for review analysis. It gives you some idea about the roadmap

DelaramRajaei commented 1 year ago

@hosseinfani

The reported bugs have been successfully resolved.

Queries 301 and 672 were returning NaN values. The first bug was identified as an issue in the code and has been rectified. The second bug occurred due to a topic lacking a "qrel" in the ir-dataset for "robust04." The absence of the "qrel" was mentioned in the accompanying paper. You can refer to the following image for further details:

Additionally, some modifications were made to improve the code. Specifically, the run() function in the main.py file was restructured, and duplicate lines were removed.

Another bug related to the backtranslation feature was identified and resolved. The issue stemmed from the lowercase storage of the model name in the df dataframe in the main.py file's build function. The model name contained the target language, such as 'fra_Latn,' but it was stored in lowercase, causing the bug.

Both bugs are fixed and a pull request has been sent.

hosseinfani commented 1 year ago

@DelaramRajaei thank you. pls put a quick comment in the code about the query with no qrels. Next step will be the help/hirt chart, right?

DelaramRajaei commented 1 year ago

@hosseinfani I added the comment on the code and push it to my repository. Should I create a new pull request for this comment?

Yes, in the next step I am completeing the plot and report my finding about it.

hosseinfani commented 1 year ago

i don't think so. it automatically accumulate

DelaramRajaei commented 1 year ago

I have updated the code and pushed the new changes.

I fixed the bug of the antique dataset. I changed main.py and abstractqexpander.py . There were some problems in reading and writing new queries in .txt files.

Here are two logs of running the code with backtranslation expander on 5 different languages for robust04 and antique datasets. log_file_antique.txt log_file_robust04.txt

DelaramRajaei commented 1 year ago

@hosseinfani I need the results of other expanders for overall comparison.

hosseinfani commented 1 year ago

@DelaramRajaei I'm uploading them in the Query Refinement channel at ReQue >> v2.0 >> qe >> output.

hosseinfani commented 1 year ago

@DelaramRajaei done!

DelaramRajaei commented 1 year ago

@hosseinfani

I have run the program for below datasets and here are the logs of running the code with backtranslation expander on 5 different languages for these datasets.

Unfortunately, I was unable to download the indexes for the ClueWeb datasets due to their large size. Could you please share the indexes with me?

I'm currently in the process of drafting the paper and analyzing the results to identify any trends.

hosseinfani commented 1 year ago

@DelaramRajaei For the record :D we got into problems when dl the files from msteams. So, I gave the key to my office to Delaram and asked her to open the computer casing and bring the hard disk (first SSD hard but then the other one). We internally attached the correct hard disk.

DelaramRajaei commented 1 year ago

I have run the program for clueweb09b and this is the log.

log_file_clueweb09b.txt

Unfortunately, I encountered a problem with the zip files for clueweb12b13, as they were found to be corrupted. I am currently exploring potential solutions to fix this issue.

In addition to that, I have been focusing on plotting the results and comparing the mean Average Precision (mAP) of the original queries with that of the backtranslated queries. So far, I have not achieved promising results. Overall, the performance did not improve with the implementation of backtranslation. However, I am investigating ways to enhance it, and trying to figure out which languages or datasets might work better and get better outcomes.

DelaramRajaei commented 1 year ago

@hosseinfani

After analyzing the results of these five datasets in five distinct languages, here are the findings: analyze.xlsx

Overall, it can be observed that the datasets "dbpedia" and "robust04" tend to yield superior results compared to the other datasets.

Additionally, Isaac compiled a list of new datasets related to law, medicine, and finance. I can process and analyze these new datasets. Also, I can change the translation model and see if that makes the results better.

DelaramRajaei commented 1 year ago

Hey @hosseinfani,

I'd like to fill you in on my activities this week. I've been working on adding tct-colbert as a dense retrieval method. I went through the RePair project and pyserini's documentation on dense retrieval. It seems that I need to modify the format of my stored files within the write_expanded_query function in the abstractqexpander.py file.

To maintain the integrity of the original code, I introduced new functions: read_queries and write_queries. The read_queries() function takes a file name as input and reads the file, adapting various formats like tagged or CSV formats. It's similar to the old read_expanded_queries function, but with a minor adjustment. I also introduced a new variable for each expander called query_set, which holds the outcomes of the expanded queries specific to that expander.

There was a problem with the write_expanded_queries function, where it read each query line and immediately expanded and wrote it to a new file with the same format. Unfortunately, this posed a challenge when attempting to add batching to the system and also using pyserini and colbert.

So, I decided to restructure the approach a bit. Let me give you an overview of how things are unfolding within the generate function.

Generate_function_process

In this sequence, we begin by providing the filename of the original queries in any format. Once the file is read, it produces a dataframe as output. The preprocess_expanded_function receives a query as input, and a loop is executed over the generated dataframe. Within the function, it initially expands the query based on the specific expander and method, and subsequently cleans it (preprocesses it). Each expanded query that is generated is then stored in the query_set variable. Here, we can also think about adding batches (although this might need quite a bit of changing based on other expanders). Additionally, we could consider integrating a message queue and broker like Celery and Redis to set up multiple instances of this function.

Afterward, by specifying the file name, the query_set will be saved in a more user-friendly CSV format.

This architeture now support using only pyserini and let me remove the anserini from the code in the search and evalute function. To modify the evaluation function, I referred to the documentation provided by Pyserini.

After encountering several bugs and errors, I'm pleased to share that I've managed to address all of them today, resulting in the project running seamlessly now.

Subsequently, I attempted to add tct_colbert to the project, and I successfully achieved it. However, I'm currently facing an indexing issue with the datasets. Specifically, I need to encode them and obtain the dense index.

In the meantime, I've gone ahead and made updates to both the environment.yaml and requirement.txt files. I've changed library versions and introduced some new ones as well. Isaac reviewed these changes and confirmed their smooth functionality.

I've made updates to the Excel task sheet. Regarding my upcoming tasks, here's the list:

My main focus is to run the project using colbert and obtain the dense indexes.
I aim to implement multiprocessing in the project using a message queue and broker, such as Celery and Redis.
Additionally, I plan to dedicate time to reading research papers and working on my own paper, which also includes a section on the datasets.

hosseinfani commented 1 year ago

@DelaramRajaei Thank you very much for the detailed report. We need a code review together so I can fully understand the changes. About the todo list, not sure I understood the Celery and Redis for multiprocessing. We'll talk.

fani-lab / ReQue

Query Refinement Backtranslation #27