Closed JasperDekoninck closed 1 month ago
Hi, we are sure we did not intend to plagiarize your paper. From our understanding, your paper is to point out the phenomenon of evasive data contamination but our paper focuses on how to detect it. We follow your experiment setup just in the preliminary experiment of in-distribution datasets, but we run more evaluation on OOD datasets like MATH, which we believe is different from your dataset scope. The finetune data settings are following your settings, we are sorry that we didn't mention that in the section 2, but we did cite your paper in the latter section. We admit this is our fault for missing reference, but it is not an intentional plagiarism. We will cite your paper in section 2 to emphasize your work's contribution in the next version of our papr.
For the code, we did use your code to run the preliminary experiment, while we edit some parts of your code so that we can run evaluation on OOD datasets. Besides, we argue that the code for the main method of our paper which is utilizing LLM's internal state for data contamination detection, is not from your repo but implemented by ourselves. We are sorry that we forgot to reference your github repo in our README. We will mention that our implementation is based on your github, the credit is yours.
Overall, we are sorry that we miss some references for your paper and code, we are not intend to plagiarize your work. If you have any other questions, please let us know.
Hey @JasperDekoninck, in my opinion the main issue here is a lack of reference and acknowledgement, while "Plagiarism" refers to stating other's novel work as their own. As my colleague @ShangQingTu has added acknowledgement to your previous work, would you mind changing the name of your issue?
Thank you for your reply. However, your current explanation is not sufficient. Specifically, the following remains unaddressed:
How does the conclusion in section 2 differ from our paper? Specifically, “Observation 1. ID performance can be easily inflated by contamination” is the exact same observation we report in Table 1 of our paper. This is done using the exact same training data and method. While your paper does present results on other benchmarks (such as GSM-Hard and OOD datasets), you need to discuss more thoroughly why this is different from our setting and what the added benefit is. Right now, you do not discuss this at all, and it seems like you’re claiming the entire experiment is original, and that Observation 1 has not been observed before in the exact same setting with the exact same hyperparameters. This, in our opinion, is scientific misconduct. Given this experiment, we do believe the main issue is plagiarism and not simply a missing reference.
As we mentioned before, we saw that you cited us in section 4 of your paper. However, the citation is not only insufficient but also makes an incorrect claim about our work (that we are using the same evaluation metric as you, which isn’t the case).
Can you please discuss more thoroughly in your paper where your experimental setup in Section 2 came from, and how your conclusions differ from ours? We would also like to see this correction as soon as possible, rather than simply in the next version of your paper.
Thank your for your kindly request. We will modify our arxiv paper and resubmit it to arxiv on September 23, 2024. We plan to conduct these changes (in bold):
\textbf{Experimental Setup.} Following the experiment settings and hyperparameters from EAL~\cite{dekoninck2024evading}, we fine-tune two representative LLMs, Llama2~\cite{touvron2023llama} and Phi2~\cite{javaheripi2023phi2}, on the instruction dataset OpenOrca~\cite{openorca}, which has no overlap with GSM8K's in-distribution data....
\textbf{Observation 1. ID performance can be easily inflated by contamination.} As depicted in Table \ref{tab1_benhcmark_alignment_tax}, we find that similar to exact contamination, in-distribution contamination can also greatly improve the performance of LLMs on in-distribution tasks. At a 2\% contamination level, both exact and in-distribution contaminated models achieve about 10\% absolute performance gain on ID benchmarks. Moreover, at a 10\% contamination level, the ID gain can be even larger (more than 20\%). This observation is consistent with the findings in EAL~\cite{dekoninck2024evading}, while we further explore whether data contamination can improve models' performances on OOD datasets.
\textbf{Models.} Following prior works~\cite{dekoninck2024evading}, we simulate contaminated and uncontaminated models by fine-tuning LLaMA2-7B~\cite{touvron2023llama} to get three different versions: LLaMA2-7B trained on OpenOrca (\textit{uncontaminated}), OpenOrca+GSM-i (\textit{contaminated}) and OpenOrca+GSM-i-Syn (\textit{contaminated}).
\textbf{Implementation Details.} We will delete: "Following prior works~\cite{dekoninck2024evading}"
Thank you again for your time. If you have any other questions, please let us know.
Thank you for your response. However, we believe that the extent of the copying from our paper necessitates a more thorough intervention. Specifically, in the introduction and abstract, there is no mention that prior work has already shown that “even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity” and "training on data similar to benchmark data can lead to severe performance overestimation". In-distribution contamination is exactly what we demonstrated in our paper and should therefore be appropriately attributed both in the intro and the abstract, particularly given that the performance overestimation you measure occurs in the same experimental setup as ours. Similarly, the statement "we design an OOD test for a set of fine-tuned LLMs simulating different levels of in-distribution" should clarify that you again use the same experimental setup as us but additionally evaluate on different, “OOD”, datasets.
Moreover, the second part of your first contribution, "... it causes the model’s capability to be overestimated on ID benchmarks," is essentially the same contribution as ours. The only difference is that you frame it in comparison to OOD benchmarks, while we discuss overestimation in a broader sense. This is a nuanced distinction that warrants at least some discussion as to why it can be a major contribution.
We also noticed that in Section 5, you use the terms "sample-level contamination detection" and "benchmark-level contamination detection". Coincidentally, we defined these exact terms in Section 2 of our paper as the two main forms of contamination. These terms are also a key dimension along which we evaluate methods in Section 4 of our paper. It seems only appropriate to add a reference to where these terms were originally coined.
Overall, the paper appears to be heavily inspired by our work: you use the same code, the same experiment in Section 2, some similarities between your experiment in Section 4 and ours, and the same terminology to differentiate between contamination detection methods. It also seems that your new technique was developed as a way to address the attack we presented in our paper. The issue raised by one of your authors in our GitHub repository further supports this, as indicated by the comment (“First and foremost, I want to express my appreciation for the outstanding work your team has done. Your data rewriting method is very effective,” March 23), where the author explicitly states their intent to reproduce our experiments, making it clear that you were aware of our work. This inspiration should be clearly stated and attributed not only on GitHub but also in your paper.
In summary, while we do not dispute that your paper makes novel contributions (primarily DICE), we ask that you clearly distinguish these from the findings of our work. In particular, we ask that you clearly acknowledge that our work has already shown the increase in (ID) performance after contamination and the differentiation between sample- and benchmark-level contamination.
Hi, thanks for your response, we will further add more references for your work. We plan to conduct these changes (in bold):
The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. Recently, EAL~\cite{dekoninck2024evading} has already shown that even training on data similar to benchmark data inflates performance, namely \emph{In-distribution contamination}. In this work, we argue that in-distribution contamination can lead to the performance drop on OOD benchmarks. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. ...
Recently, EAL~\cite{dekoninck2024evading} has already shown that training on data that bears similarity to the benchmark data can lead to severe performance overestimation, namely \emph{In-distribution contamination}. Since pre-training data is massive and hard to distinguish based on their distributions, we narrow our scope to the supervised fine-tuning (SFT) phase. We aim to answer the following research questions: (1) Does in-distribution contamination contribute to a model's overall math reasoning ability? (2) If not, how can we detect it to prevent overestimating the model's capabilities due to contamination?
To investigate whether in-distribution contamination can really improve LLM's math reasoning ability, we design an OOD test for a set of fine-tuned LLMs simulating different levels of in-distribution contamination on GSM8K following prior work's experiment setup~\cite{dekoninck2024evading}. ...
We examine the impact of in-distribution contamination on LLMs' performance on both ID and OOD tasks, revealing that it can lead to the model's performance drop on OOD benchmarks.
\section{Related Work} Prior work~\cite{dekoninck2024evading} has divided the current data contamination detection methods into two categories, benchmark-level contamination detection and sample-level contamination detection. ....
\section{Acknowledgements} We wish to express our appreciation to the pioneers in the field of evasive data contamination~\cite{dekoninck2024evading}. Our work was developed as a way to address the attack presented in the evasive data contamination~\cite{dekoninck2024evading}.
Besides, we will add in the acknowledgement on the github and the paper that our work is inspired by yours. Thank you again for your time. If you have any other questions, please let us know.
Thank you for your reply! That seems good to us
We recently noticed your paper which seemed like an interesting method to detect contamination in language models. However, it seems like it copies the entire experimental setup of our paper, "Evading Data Contamination is (too) Easy" (https://arxiv.org/abs/2402.02823), without any reference to our work as inspiration for this setup. Further, your implementation (i.e., this repository) copies large portions of our implementation without proper attribution. In more detail:
Implementation: This implementation shows blatant and obvious signs of plagiarism. In particular, our entire GitHub repo (https://github.com/eth-sri/malicious-contamination) was copied into this one, except that both the License and the README of our repo were removed. Even the results from our paper are still visible in the notebook you copied! This is not only scientific misconduct and plagiarism but also a copyright violation.
Experimental setup: Your experimental setup matches ours to a large degree without this being credited in any way:
Table 1:
Table 2:
Please fix both your github repo by including the correct licenses and prominent attribution of our work and your paper to acknowledge our work and compare it in detail. Otherwise, we will have to take further steps, such as issuing a takedown notice for your GitHub repo.