RicherMans / text_based_depression

Source code for the paper "Text-based Depression Detection: What Triggers An Alert"
44 stars 10 forks source link

About the experimental results #8

Closed chengturbo closed 1 year ago

chengturbo commented 1 year ago

Hello, author! I read your paper and code and have the following questions that I hope you can answer.

  1. you mentioned in the abstract that the macro F1 score on the DAIC-WOZ development set is 0.84. but in table 8 it is written validation subset, does it mean the divided training set here. And in table 9 on validation set written F1 value is 64%. This is confusing me.
  2. i ran your latest code but the result F1 value is not up to 0.84. Hope I can get your reply!
RicherMans commented 1 year ago

Hey there,

you mentioned in the abstract that the macro F1 score on the DAIC-WOZ development set is 0.84. but in table 8 it is written validation subset, does it mean the divided training set here. And in table 9 on validation set written F1 value is 64%. This is confusing me.

As far as I remember ( the last time i used that dataset is already years ago ), we use not the entire training dataset for training, but also split that into a "cv" and "train" split. We reported the results for this "cv" split and the official "dev" split. There are not labels for the official "eval" split, thus most works use "dev" as the final evaluation. The values are different because one of them (table 8) are average results that you will expect, while (table 9) results are the maximas that were obtained during training.

Why do we use maximal results? Well because this is a dataset that has been used for a competition. In a competition it does not matter if your results are "reproducible", it only matters that your results are good. Thus we show both, an "average" that you can expect using our method and a "best" result that is comparable to other participants of the challenge.

i ran your latest code but the result F1 value is not up to 0.84.

Of course, I ran the experiment 50-100 times and in one of these experiments I encountered that value, in general, your experiments should give an F1 of around 0.6.

In fact, I personally became to hate this dataset (DAIC-WOZ) for this "randomness" due to the small size of the dataset: Many works published at the time of my work simply output some maximal values that are generally unobtainable and I needed to follow suit. When trying to publish this kind of work, we always got 2 types of responses from reviewers 1.: The dataset is too small, it is not representative/generalizable (doh it's hard to collect clinical depression data, if there would be a larger dataset we would run our experiments on that) or 2.: Your results are weak ( because we post averages instead of maximal values), compared to previous work. Both of these types of reviewers are too frequent to count. However, if you want to continue to work in depression detection using DAIC-WOZ, I'd recommend the work of my colleague Wenwu like 1, 2. Her work is generally very thought out and not so much focused on "maximal performance", which makes it easier to compete against.

Kind regards, Heinrich

chengturbo commented 1 year ago

Hey there,

you mentioned in the abstract that the macro F1 score on the DAIC-WOZ development set is 0.84. but in table 8 it is written validation subset, does it mean the divided training set here. And in table 9 on validation set written F1 value is 64%. This is confusing me.

As far as I remember ( the last time i used that dataset is already years ago ), we use not the entire training dataset for training, but also split that into a "cv" and "train" split. We reported the results for this "cv" split and the official "dev" split. There are not labels for the official "eval" split, thus most works use "dev" as the final evaluation. The values are different because one of them (table 8) are average results that you will expect, while (table 9) results are the maximas that were obtained during training.

Why do we use maximal results? Well because this is a dataset that has been used for a competition. In a competition it does not matter if your results are "reproducible", it only matters that your results are good. Thus we show both, an "average" that you can expect using our method and a "best" result that is comparable to other participants of the challenge.

i ran your latest code but the result F1 value is not up to 0.84.

Of course, I ran the experiment 50-100 times and in one of these experiments I encountered that value, in general, your experiments should give an F1 of around 0.6.

In fact, I personally became to hate this dataset (DAIC-WOZ) for this "randomness" due to the small size of the dataset: Many works published at the time of my work simply output some maximal values that are generally unobtainable and I needed to follow suit. When trying to publish this kind of work, we always got 2 types of responses from reviewers 1.: The dataset is too small, it is not representative/generalizable (doh it's hard to collect clinical depression data, if there would be a larger dataset we would run our experiments on that) or 2.: Your results are weak ( because we post averages instead of maximal values), compared to previous work. Both of these types of reviewers are too frequent to count. However, if you want to continue to work in depression detection using DAIC-WOZ, I'd recommend the work of my colleague Wenwu like 1, 2. Her work is generally very thought out and not so much focused on "maximal performance", which makes it easier to compete against.

Kind regards, Heinrich

Firstly, thank you very much for your response! I have been troubled by the results recently. Secondly, I still don't quite understand the first question. Can I understand that F1 0.84 is the best result on your partitioned CV set, while 0.64 is the result on the official development set?

RicherMans commented 1 year ago

Secondly, I still don't quite understand the first question. Can I understand that F1 0.84 is the best result on your partitioned CV set, while 0.64 is the result on the official development set?

I think you mean answer? As table 8 states: "The reported results represent the best achieved results during our experiments for a single fold". 0.84 is the best result on the dev set, not our cv.

Table 9 results are across multiple runs, averaging their respective scores.

Let me put it this way: We run the experiment on the training data by first splitting it into 5 equal parts (folds), where each fold has 20% of the available data. Note that this splitting uses a fixed seed, and thus can be repeated using a different seed, which leads to different training/cv splits. Then we train a classifier by using 80% of the data as "train" and 20% as "cv". This is then repeated 5 times (5-fold) for each individual split. After training you obtain for each fold some metrics like F1, ACC and so on.

The results in Table 8 are the maximal performance that has been achieved using a single fold (i.e., we were lucky). The results in Table 9 are the average across these 5 folds for multiple runs (changing the seed) alltogether.

Still confused?

Kind regards, Heinrich Dinkel

chengturbo commented 1 year ago

Secondly, I still don't quite understand the first question. Can I understand that F1 0.84 is the best result on your partitioned CV set, while 0.64 is the result on the official development set?

I think you mean answer? As table 8 states: "The reported results represent the best achieved results during our experiments for a single fold". 0.84 is the best result on the dev set, not our cv.

Table 9 results are across multiple runs, averaging their respective scores.

Let me put it this way: We run the experiment on the training data by first splitting it into 5 equal parts (folds), where each fold has 20% of the available data. Note that this splitting uses a fixed seed, and thus can be repeated using a different seed, which leads to different training/cv splits. Then we train a classifier by using 80% of the data as "train" and 20% as "cv". This is then repeated 5 times (5-fold) for each individual split. After training you obtain for each fold some metrics like F1, ACC and so on.

The results in Table 8 are the maximal performance that has been achieved using a single fold (i.e., we were lucky). The results in Table 9 are the average across these 5 folds for multiple runs (changing the seed) alltogether.

Still confused?

Kind regards, Heinrich Dinkel

Thank you very much for your reply. I have been able to understand the results of the experiment. I wish you a good life!