Should we use the test dataset to evaluate the model performance?

Hi @BinLiang-NLP ! In this line, the test set is used as the verification set to determine which model to save as the final test model. Is it reasonable to do so? Referring to the TOAD paper, on the SemEval2016TaskA dataset, the training set should be divided into 15% as the verification set. However, according to TOAD's approach, in fact, the performance of the vallina BERT cannot reach the performance reported in the current paper and it might be less than 50% on each subset. I tried many ways, but I couldn't reproduce the Bert performance reported in the TOAD paper (that is, the results quoted in your paper). Considering that your model is also based on BERT, I want to know if you can fully reached the results of the report on the subset of SemEval2016 in the subset of the "85%training set -15%verification set" mode (such as 55% on the "A" subset)?

Hi @BinLiang-NLP ! In this line, the test set is used as the verification set to determine which model to save as the final test model. Is it reasonable to do so? Referring to the TOAD paper, on the SemEval2016TaskA dataset, the training set should be divided into 15% as the verification set. However, according to TOAD's approach, in fact, the performance of the vallina BERT cannot reach the performance reported in the current paper and it might be less than 50% on each subset. I tried many ways, but I couldn't reproduce the Bert performance reported in the TOAD paper (that is, the results quoted in your paper). Considering that your model is also based on BERT, I want to know if you can fully reached the results of the report on the subset of SemEval2016 in the subset of the "85%training set -15%verification set" mode (such as 55% on the "A" subset)?

Thanks for your good question. As you mentioned, unfortunately, we cannot obtain the results of the Vallina BERT reported in TOAD by following the settings of selecting 15% data as the verification set from the training set. Zero-shot stance detection is a particular task, where the testing targets are unknown to the training sets. We conducted some preliminary experiments and found that selecting 15% data as the verification set from the training set is unreasonable. Because the targets of the validation set are also unknown to the test set, the impact of the performance of the validation set on the test set is random. That is, a validation set with superior performance may perform very poorly on the test set. Therefore, without any other better options, we use the test set as the verification set for parameter tuning. Please feel free to contact me if you have any further questions.

Hi @BinLiang-NLP ! In this line, the test set is used as the verification set to determine which model to save as the final test model. Is it reasonable to do so? Referring to the TOAD paper, on the SemEval2016TaskA dataset, the training set should be divided into 15% as the verification set. However, according to TOAD's approach, in fact, the performance of the vallina BERT cannot reach the performance reported in the current paper and it might be less than 50% on each subset. I tried many ways, but I couldn't reproduce the Bert performance reported in the TOAD paper (that is, the results quoted in your paper). Considering that your model is also based on BERT, I want to know if you can fully reached the results of the report on the subset of SemEval2016 in the subset of the "85%training set -15%verification set" mode (such as 55% on the "A" subset)?

Thanks for your good question. As you mentioned, unfortunately, we cannot obtain the results of the Vallina BERT reported in TOAD by following the settings of selecting 15% data as the verification set from the training set. Zero-shot stance detection is a particular task, where the testing targets are unknown to the training sets. We conducted some preliminary experiments and found that selecting 15% data as the verification set from the training set is unreasonable. Because the targets of the validation set are also unknown to the test set, the impact of the performance of the validation set on the test set is random. That is, a validation set with superior performance may perform very poorly on the test set. Therefore, without any other better options, we use the test set as the verification set for parameter tuning. Please feel free to contact me if you have any further questions.

Thanks for your quick answer! I can't quite agree with your point of view.

In the papers I know (including TPDG, PT-HCL and JointCL written by you), the unseenness of the target topic labeling data is emphasized. In other words, whether it is capturing the expression that may be shared between topics from the source domain (PT-HCL), or introducing knowledge from the outside (CKE-Net), or introducing an extended unlabeled target Domain data (TOAD) is acceptable, as long as the model does not touch the labeled data of the target domain, we think it is a feasible solution to ZSSD. The only thing is unacceptable that use labeled data, or even directly use the data of the test set to participate in the training of the model (that is, as a validation set), and it does not meet the definition of "zero-shot learning". Because "zero-shot learning" is to solve the problem of no target domain labeled data, but now in order to solve this problem, it is necessary to use the labeled data of the target domain, which drops into a loop.
I have reservations about what you mentioned that "Because the targets of the validation set are also unknown to the test set, the impact of the performance of the validation set on the test set is random.". First, as a "zero-shot learning" model, it should have strong domain transfer capabilities, so it is reasonable to use the performance on the validation set to represent the performance of the model on the test set. Second, you really don't mention the "15%" setting in your JointCL paper, but you did mention it in both your previous work (TPDG and PT-HCL). So as you said, it is unreasonable to use "15%" setting in JointCL, but you use "15%" setting in PT-HCL and TPDG, why? According to your code, PT-HCL uses the "15%" setting, but TPDG uses a similar settings to JointCL, which is really confusing. I hope you can provide answers to the above questions, as this is critical for reproducing and citing your work.
You reported many baselines on SemEval16, such as BERT-GCN, TGA Net, etc., and are these also trained using the test set as the validation set? If so, can you expose the code of these models for our reference?

Looking forward to your reply!

Hi @BinLiang-NLP ! In this line, the test set is used as the verification set to determine which model to save as the final test model. Is it reasonable to do so? Referring to the TOAD paper, on the SemEval2016TaskA dataset, the training set should be divided into 15% as the verification set. However, according to TOAD's approach, in fact, the performance of the vallina BERT cannot reach the performance reported in the current paper and it might be less than 50% on each subset. I tried many ways, but I couldn't reproduce the Bert performance reported in the TOAD paper (that is, the results quoted in your paper). Considering that your model is also based on BERT, I want to know if you can fully reached the results of the report on the subset of SemEval2016 in the subset of the "85%training set -15%verification set" mode (such as 55% on the "A" subset)?

Thanks for your good question. As you mentioned, unfortunately, we cannot obtain the results of the Vallina BERT reported in TOAD by following the settings of selecting 15% data as the verification set from the training set. Zero-shot stance detection is a particular task, where the testing targets are unknown to the training sets. We conducted some preliminary experiments and found that selecting 15% data as the verification set from the training set is unreasonable. Because the targets of the validation set are also unknown to the test set, the impact of the performance of the validation set on the test set is random. That is, a validation set with superior performance may perform very poorly on the test set. Therefore, without any other better options, we use the test set as the verification set for parameter tuning. Please feel free to contact me if you have any further questions.

Thanks for your quick answer! I can't quite agree with your point of view.

In the papers I know (including TPDG, PT-HCL and JointCL written by you), the unseenness of the target topic labeling data is emphasized. In other words, whether it is capturing the expression that may be shared between topics from the source domain (PT-HCL), or introducing knowledge from the outside (CKE-Net), or introducing an extended unlabeled target Domain data (TOAD) is acceptable, as long as the model does not touch the labeled data of the target domain, we think it is a feasible solution to ZSSD. The only thing is unacceptable that use labeled data, or even directly use the data of the test set to participate in the training of the model (that is, as a validation set), and it does not meet the definition of "zero-shot learning". Because "zero-shot learning" is to solve the problem of no target domain labeled data, but now in order to solve this problem, it is necessary to use the labeled data of the target domain, which drops into a loop.

I have reservations about what you mentioned that "Because the targets of the validation set are also unknown to the test set, the impact of the performance of the validation set on the test set is random.". First, as a "zero-shot learning" model, it should have strong domain transfer capabilities, so it is reasonable to use the performance on the validation set to represent the performance of the model on the test set. Second, you really don't mention the "15%" setting in your JointCL paper, but you did mention it in both your previous work (TPDG and PT-HCL). So as you said, it is unreasonable to use "15%" setting in JointCL, but you use "15%" setting in PT-HCL and TPDG, why? According to your code, PT-HCL uses the "15%" setting, but TPDG uses a similar settings to JointCL, which is really confusing. I hope you can provide answers to the above questions, as this is critical for reproducing and citing your work.

You reported many baselines on SemEval16, such as BERT-GCN, TGA Net, etc., and are these also trained using the test set as the validation set? If so, can you expose the code of these models for our reference?

Looking forward to your reply!

Hi, thank you for raising such insightful questions. My previous reply may confuse you. As we know, the validation set is used to tune the parameters of the model to achieve a high-performance training model. In our code, we use test data as the validation set only to stably obtain a superior training model. Of course, modifying the corresponding code to divide into 15% of the training set as the verification set can also achieve superior testing performance, but it is not easy (for all baseline models). I do agree that "zero-shot learning" is to solve the problem of no target domain labeled data. Using the test data as the validation set does not violate the setting of zssd, as it is not used for training. As I mentioned earlier, the impact of the performance of the validation set on the test set is random. This is because the purpose of zssd is to train a model that can effectively perform stance detection on unknown targets. However, due to the particularity of zssd, using known targets (selecting 15% data as the verification set from the training set) to select the trained model for the unknown targets might bring a data gap. This results in the validation set losing its intended function. For example, a seemingly superior performance model selected through the validation set performs poorly on the test set. Therefore, in my opinion, for this task, we do not simply divide 15% of the data from the training set as the validation set, but rather design a validation set that can be applied to the test data, which is reasonable and can better obtain an appropriate zssd model. Of course, using the test set as the validation set is not a completely correct and appropriate method, it is just a helpless move to more easily show superior performance. For the baseline models you mentioned, you run open-source codes and then discover the corresponding problem.

Perhaps my previous questions were not clear enough, which confuses you. First of all, I want to show that: the test set as the validation set has caused data leakage. As you said, the validation set is used for tuning model parameters and early stopping. Although the verification set cannot directly update the parameters of the model, it indirectly participates in the screening of model parameters, which is essentially a kind of model tuning, and which will lead to data leakage. It doesn't matter whether you do ZSSD or not, because that's a basic principle of machine learning or deep learning.

My questions about the "15%" setting are:

Your interpretation and implementation of "15%" setting in multiple works and open source repositories are inconsistent: your settings for SemEval2016 are different in JointCL, TPDG, and PT-HCL, and some use "15%" setting, some use the "test set as validation set" setting, which is confusing, and I hope you can explain it.
Does the baselines reported in the JointCL paper use the same dataset setting as JointCL: Since you cited the vallina BERT performance from TOAD in JointCL paper, that means you agree with the "15%" setting and think it is reproducible ( Otherwise you wouldn't be citing that baseline performance in your paper). However, your answer today is that the "15%" setting is unreasonable and can not reproduce the vanilla BERT performance reported in TOAD. So there is an ambiguity in the performance of all the baselines tested on SemEval2016 that you reported in your JointCL paper: what dataset settings were they done on? Was it done with the "15%" setting that you approved of when you wrote the JointCL paper, or with the "15%" setting that you think it is unreasonable now? This needs to be unified, otherwise the paper loses its rigor, rather than avoiding this issue because of "it is not easy (for all baseline models)" or "helpless move".

I hope that you can provide detailed answers to the above two questions, which are very important for the community to spread your methods.

Perhaps my previous questions were not clear enough, which confuses you. First of all, I want to show that: the test set as the validation set has caused data leakage. As you said, the validation set is used for tuning model parameters and early stopping. Although the verification set cannot directly update the parameters of the model, it indirectly participates in the screening of model parameters, which is essentially a kind of model tuning, and which will lead to data leakage. It doesn't matter whether you do ZSSD or not, because that's a basic principle of machine learning or deep learning.

My questions about the "15%" setting are:

Your interpretation and implementation of "15%" setting in multiple works and open source repositories are inconsistent: your settings for SemEval2016 are different in JointCL, TPDG, and PT-HCL, and some use "15%" setting, some use the "test set as validation set" setting, which is confusing, and I hope you can explain it.

Does the baselines reported in the JointCL paper use the same dataset setting as JointCL: Since you cited the vallina BERT performance from TOAD in JointCL paper, that means you agree with the "15%" setting and think it is reproducible ( Otherwise you wouldn't be citing that baseline performance in your paper). However, your answer today is that the "15%" setting is unreasonable and can not reproduce the vanilla BERT performance reported in TOAD. So there is an ambiguity in the performance of all the baselines tested on SemEval2016 that you reported in your JointCL paper: what dataset settings were they done on? Was it done with the "15%" setting that you approved of when you wrote the JointCL paper, or with the "15%" setting that you think it is unreasonable now? This needs to be unified, otherwise the paper loses its rigor, rather than avoiding this issue because of "it is not easy (for all baseline models)" or "helpless move".

I hope that you can provide detailed answers to the above two questions, which are very important for the community to spread your methods.

I will reply to your questions in the points below:

We can use the "15%" setting for all models to tune model parameters in the training process, including JointCL, PT-HCL, and other baseline models. Corresponding to the code of this work, you can modify the code of this line as "15%" setting to tune the model parameters. This is also feasible. In some cases, similar results can also be obtained, but this may be quite random. While TPDG is a cross-target stance detection model.
In terms of validation sets of SEM16 and WTWT datasets, due to the fact that these two datasets are not explicitly partitioned into validation sets and the previous methods do not have uniform settings, we use testing data as the validation set. This is because we conducted experiments using the open-source code of existing models and found that such settings can stably obtain the results reported in the papers. The "15%" setting, however, results in very unstable results. We cited the Vallina BERT performance from TOAD does not mean we agree with the "15%" setting. Since TOAD does not specify what the BERT setting is, and after reproducing the results, we found that using the "test data as validation" setting could achieve results similar to those reported in the original report. For the TOAD model, please seek advice from the authors of the TOAD paper to understand why the "15%" setting cannot reproduce the results of the paper.
We do agree that the validation set needs to be unified. After extensive experiments, we prefer to use the "test data as validation" setting for SEM16 and WTWT datasets in this task. Therefore, we use test data as the validation set only to stably obtain a superior training model. Of course, a more appropriate approach is to create fixed validation sets for these two datasets.
For the "15%" setting, a simple method to create validation sets, it is very effective in many classification tasks, and I guess that's why TOAD uses this setting. However, for this task, you can imagine that using some targets to tune parameters and then testing on some unknown targets, there will be a gap between them.
For the validation set issue of this task, I think it's not just a simple "15% setting" or "test data as validation" setting, but rather to build a reasonable validation set, and the parameter adjustments in the validation set should be meaningful for the test data.

HITSZ-HLT / JointCL

Should we use the test dataset to evaluate the model performance? #7