Closed shadowworldqe closed 1 week ago
Thank you very much for bringing the license issue to our attention. First and foremost, I sincerely apologize for my mistake. RAGLAB is the first repository I have developed and maintained. As a beginner, I lack experience in managing large projects, particularly in areas such as creating well-crafted README files, icons, and licenses.
FlashRAG is an outstanding open-source project that has made significant contributions to the RAG community. I acknowledge my mistake in copying FlashRAG's LICENSE file and referencing the style of their readme document. Our error is limited to the readme file only. However, we absolutely did not plagiarize FlashRAG's code. I deeply regret this error and once again apologize to everyone for my mistake.
Furthermore, we must clarify that all code in RAGLAB was developed independently by our team. We have open-sourced the raglab-exp repository to provide evidence that the RAGLAB repository was independently developed. The raglab-exp repository contains over 400 commit records, which serve as strong evidence supporting our claim. The reason we had not previously open-sourced the raglab-exp repository is that we were in the process of developing new algorithms. Additionally, we have used copydetect to analyze the code similarity between RAGLAB and FlashRAG. The experimental results can be found in the code copydetect result.
Reply for issues#1
Q1: In paper Table 1, you claim that FlashRAG lacks fair comparison,but it seems to support fair comparisons of different methods?
The initial motivation for designing RAGLAB was our discovery during literature review that published algorithms do not provide a fair comparison. The key components used in different papers, such as the knowledge database, retriever model, generation model, and instructions, vary significantly. We pointed out that FlashRAG lacks a fair comparison because it does not provide a unified generation model.
In FlashRAG:
Thus, the different algorithms in FlashRAG do not utilize the same base model as the generator, which is why we believe FlashRAG lacks fair comparisons. We have already highlighted this point in our paper.
During the evaluation of RAGLAB, we used Llama3-8B as the base model and trained four models based on it: llama3-8B-baseline, selfrag-llama3-8b, llama3-70B-adaptor, and selfrag-llama3-70B-adaptor. RAGLAB is the first project to train the selfrag-llama3-70B-adaptor model, and has open-sourced it on the Hugging Face website During the training process, we used the same training parameters, the same training scripts, and the same training data (utilizing SelfRAG's open-source training data url; all special tokens were removed when training the llama3-8B-baseline and llama3-70B-adaptor models). After training on the same data, we believe that llama3-8B-baseline and selfrag-llama3-8B are fairly comparable, as are llama3-70B-adaptor and selfrag-70B-adaptor. This is an aspect that FlashRAG has not addressed. Additionally, we have open-sourced the entire process, including the training dataset url, training script parameters url, and all four models.
Q2. To my knowledge, FlashRAG does not have a trainer and does not match the claim in the paper.
Trainer
column, we mistakenly indicated that FlashRAG includes a trainer function. This was a writing error on my part. You are correct FlashRAG does not provide any trainer-related functionality. We have corrected this mistake in our paper on arXiv and have uploaded a revised version. Once again, thank you for bringing this issue to our attention.Q3. The license for the warehouse is from FlashRAG, and even the name is the same as its author.
Q4. The overall design structure and diagram are very similar
I would like to make a brief statement regarding RAGLAB. First, the framework design of RAGLAB is different from that of FlashRAG. RAGLAB does not categorize different advanced RAG algorithms nor does it abstract a pipeline class; instead, each algorithm is abstracted into a separate class corresponding to an individual Python file. In contrast, FlashRAG categorizes different algorithms and designs distinct pipelines for each category, with algorithms using the same pipeline being grouped into a single Python file. RAGLAB considers all advanced algorithms as improvements upon NaiveRAG, so all algorithms in RAGLAB inherit from NaiveRAG, and all utils are defined within NaiveRAG. This is fundamentally different from FlashRAG, where different advanced RAG algorithms inherit from different pipelines.
Secondly, I would like to address the issue regarding the similarity between RAGLAB and FlashRAG diagrams. In RAGLAB, all advanced RAG algorithms inherit from the NaiveRAG algorithm, with NaiveRAG providing a wide range of utilities. On the other hand, in FlashRAG, there is no inheritance relationship between different pipelines, which is a fundamental difference between RAGLAB and FlashRAG. For more detailed differences, please refer to the raglab-vs-flashrag section.
To demonstrate that RAGLAB was developed independently, we have compared the creation timesline of RAGLAB and FlashRAG, as well as the progress of algorithm development. For specific details, please refer to the table in the next section Table.
Comparison of the Creation Times and Algorithm Development Progress between the RAGLAB and FlashRAG Repositories
Time Nodes | RAGLAB | evidence | FlashRAG | evidence | Whether RAGLAB is earlier than FlashRAG |
---|---|---|---|---|---|
GitHub Repository Creation Time | 2024-02-07 | url init-time |
2024-03-14 | url init-time |
✅ |
NaiveRAG's Development Time | 2024-02-12 | url | 2024-04-04 (Deleted) | url | ✅ |
BasicPipeline's Development Time | Not Developed | 2024-03-19 | url | - | |
SequentialPipeline's Development Time | Not Developed | 2024-03-19 | url | - | |
selfrag's Development Time | 2024-02-26 | url | 2024-04-09 | url | ✅ |
RRR's Development Time | 2024-03-04 | url | Not Developed | - | |
iter-gen's Development Time | 2024-03-09 | url | 2024-04-08 | url | ✅ |
active rag's Development Time | 2024-03-09 | url | 2024-04-11 | url | ✅ |
DSP's Development Time | 2024-03-29(Developed) 2024-04-22(Deleted) |
url-dev url-remove |
Not Developed | - | |
selfask's Development Time | 2024-04-08 | url | 2024-04-13 | url | ✅ |
GitHub Repository Open Source Time | 2024-08-05 | url | 2024-05-24 | public-1 public-2 public-3 public-4 |
❌ |
Arxiv Paper Publication Date | 2024-08-21 | url | 2024-05-22 | url | ❌ |
[!Note]
- The open-source date of the FlashRAG repository cannot be determined from the commit files, so we used Google search with a time filter to find the date when the FlashRAG repository was made public.
- The NaiveRAG's Develop Time in FlashRAG is the date of record deletion, not the development date.
- For algorithms that have not been developed, we filled in 'Not Developed' in the table.
- RAGLAB initially attempted to develop the DSP algorithm, but after multiple discussions, we concluded that DSP is not suitable for comparison with advanced RAG algorithms. As a result, we deleted the DSP algorithm on 2024-04-22. The commit record of the deletion can be found here: url.
Summary:
Code Similarity Test between RAGLAB and FlashRAG
pip install copydetect
git clone https://github.com/RUC-NLPIR/FlashRAG.git
git clone https://github.com/fate-ubw/RAGLAB.git
copydetect -t RAGLAB -r FlashRAG -e py
Differences in System Design between RAGLAB and FlashRAG
Different Framework Design Concepts: RAGLAB does not categorize different advanced RAG algorithms, nor does it abstract a pipeline class. Instead, each algorithm is abstracted into a separate class corresponding to an individual Python file. In contrast, FlashRAG categorizes different algorithms and designs distinct pipelines for each category, with algorithms using the same pipeline grouped into a single Python file. RAGLAB considers all advanced algorithms as improvements upon NaiveRAG, so all algorithms in RAGLAB inherit from NaiveRAG, and all utils are defined within NaiveRAG. This is a fundamental difference from FlashRAG, where different advanced RAG algorithms inherit from different pipelines. RAGLAB currently integrates six advanced algorithms, while FlashRAG integrates 14 algorithms.
Generator Alignment Differences: During the evaluation of RAGLAB, we used Llama3-8B as the base model and trained four models based on it: llama3-8B-baseline, selfrag-llama3-8b, llama3-70B-adaptor, and selfrag-llama3-70B-adaptor. During the training process, we used the same training parameters, the same training scripts, and the same training data (using SelfRAG's open-source training data url; all special tokens were removed when training the llama3-8B-baseline and llama3-70B-adaptor models). After training on the same data, we believe that llama3-8B-baseline and selfrag-llama3-8b are fairly comparable, as are llama3-70B-adaptor and selfrag-70B-adaptor. FlashRAG, however, did not use a unified base model for training; instead, different algorithms used different models.
In our paper, we argued that FlashRAG lacks fair comparisons because it did not use a unified base model for training, but rather different algorithms used different models: 1) In FlashRAG, the SelfRAG algorithm uses selfrag-llama2-7B, which, compared to llama2-7B, was trained on an additional 140,000 data points during fine-tuning, creating an unfair advantage over other generation models. 2) The Spring algorithm in FlashRAG uses Llama2-7B-chat with a trained embedding table. 3) The Ret-Robust algorithm in FlashRAG uses LLAMA2-13B with a trained LoRA. 4) The remaining algorithms in FlashRAG use LLAMA3-8B-instruct.
Thus, the different algorithms in FlashRAG do not utilize the same base model as the generator, which is why we believe FlashRAG lacks fair comparisons. We have already highlighted this point in our paper.
Different Retrievers: RAGLAB integrates two retrieval models, ColBERT and Contriever, whereas FlashRAG does not integrate ColBERT or handle ColBERT embeddings. RAGLAB designed a ColBERT Server & API functionality that allows over 10 scripts to concurrently access the ColBERT server. ColBERT Server & API feature significantly conserves resources and addresses our limited resource issues, but FlashRAG does not provide this functionality. On the other hand, FlashRAG integrates BM25, embedding models, and T5-based models, which RAGLAB does not.
Interact Mode & Evaluation Mode: We developed both an Interact Mode and an Eval Mode for each algorithm. The goal of Interact Mode is to allow non-research users to quickly understand the essence of the algorithms without needing to download datasets. Evaluation Mode is designed to replicate the results presented in the paper. FlashRAG did not design an Interact Mode for each algorithm; instead, it created a UI interface specifically for Simple RAG.
Evaluation Section: In addition to accuracy, F1 score, and exact match (EM), we also integrated FactScore and ALCE, two advanced metrics. However, FlashRAG did not integrate or evaluate FactScore and ALCE.
Completely Different Dataset Loader Design: RAGLAB does not process raw datasets; instead, it designs a separate class for each dataset to directly load the raw dataset. Please refer to the dataset class FlashRAG, on the other hand, processed 32 datasets from scratch and unified the format of all datasets. Therefore, FlashRAG does not need to create a new class for each dataset. The 32 datasets open-sourced by FlashRAG provide significant convenience for researchers and make a substantial contribution to the RAG open-source community.
Instruction Lab: To align each component, especially the impact of instructions on results, RAGLAB adopted a design approach that separates algorithm logic from data. We built instruction_lab.py, where all algorithms load their instructions and prompts from the Instruction Lab to ensure alignment across different algorithms. FlashRAG, on the other hand, constructs instructions by defining PromptTemplate
within the algorithm itself, as can be seen in pipeline.py. FlashRAG does not separate algorithm logic from specific instruction, while RAGLAB separates all instructions from algorithm logic, achieving centralized management of instructions. There is a fundamental difference between RAGLAB and FlashRAG in the implementation of instructions.
Logger: To facilitate the management of experimental results, we designed a logger that can save all log information generated during the evaluation process and store it separately as a .log
file. However, FlashRAG does not provide this functionality.
Different Experimental Conclusions: RAGLAB evaluated the performance of six algorithms on the ARC, MMLU, PubHealth, StrategyQA, and Multiple-choice tasks, as well as on the FactScore and ALCE datasets. However, FlashRAG did not evaluate the aforementioned tasks; instead, FlashRAG evaluated the NQ and WebQA datasets, which RAGLAB did not assess.
Finally, we once again apologize for the mistakes we made. The issues related to copying the FlashRAG license and the problems in Table 1 of our paper have been corrected, and we have actively communicated with the authors. We will provide timely updates on the situation.
GitHub is an open platform, and we welcome everyone to monitor our work and raise any issues. We will also actively respond to any concerns. Although both RAGLAB and FlashRAG aim to compare different algorithms, our goals are not the same. RAGLAB strives for rigorous and fair comparisons, while FlashRAG did not align the generators of different algorithms, a point we have addressed in our paper and the above documentation. RAGLAB has invested a significant amount of time in training four models. Additionally, the resources and maintenance team for the RAGLAB project are much smaller than those of the FlashRAG team. The FlashRAG team has made outstanding contributions to the RAG community, and we are very willing to work with them to advance the progress of the RAG community.
If you have any questions, feel free to reopen this issue~
Hello, I have carefully read the content and code instructions of the paper, and it seems that the difference between this repository and FlashRAG is not significant. And there seem to be some incorrect claims in the comparison of the paper(Table 1):
And there are some instances of plagiarism on the pages of the GitHub repository: