Machine-Learning-for-Medical-Language / long-clinical-doc

6 stars 1 forks source link

Question about the F1 score #2

Closed empyriumz closed 1 week ago

empyriumz commented 1 month ago

Hi,

I came across your work while researching literature that uses MIMIC-IV's clinical note data. In Table 2 of your paper, I noticed that many models have ROC-AUC scores greater than 0.8, but their F1 scores are very low. I'd like to ask about how you determine the threshold for calculating the F1 score.

For problems with highly imbalanced label distributions, using the default threshold of 0.5 is likely to underestimate the model's performance. For instance, in Table 2, the Hierarchical Transformers (xdistill - bigchunk) has a lower AUC but a higher F1 score compared to the Hierarchical Transformers (PubMedBERT).

Could you please clarify your threshold selection method for the F1 score calculation? Thank you for your time and for sharing your paper and the code.

wonjininfo commented 1 month ago

Hi @empyriumz ,

Thank you very much for your interest in our work.

The threshold for converting probabilities to binary labels (i.e., positive or negative) is set at 0.5. For calculating precision, recall, and F1 score, we used the default settings of scikit-learn. We put binary labels to these functions. (This paragraph was edited from the previous comment.)

We are highly interested in finding better metrics for our task, but I thought that this is less confusing to broader audiences, including the NLP community. In the final version of the manuscript, we are planning to include alternative metrics such as AUPRC as a complementary metric.

Once again, thank you very much for your comment! WonJin

empyriumz commented 1 month ago

Hi @wonjininfo,

Thanks for the clarification! I agree that AUPRC is a useful complementary metric. However, I don't think a 0.5 cutoff is a standard for the F1 score, especially in cases of class imbalance. The optimal threshold for maximizing the F1 score can vary significantly from 0.5.

I ran a simple experiment using a sample of about 6,000 patients from your cohort with a small pretrained model (Gatortron base, no fine-tuning), and found that setting the threshold to 0.26 resulted in an F1 score of 0.32, comparable to GPT-4 performance. So I believe if you do the similar threshold tuning you would be able to get a higher F-1 score. For reference, in my case the AUC is only 0.84.

wonjininfo commented 1 week ago

Hi @empyriumz ,

I apologize for the delayed response. I had drafted a reply and mistakenly thought I had posted it, but clearly, I missed it—my mistake.

Regarding your point on the 0.5 cutoff, I agree that it might not be the best strategy for supervised models dealing with class imbalance.

Before diving into the threshold discussion, I'd like to mention: unlike other popular metrics such as AUPRC or AUROC, the F1 score—our primary metric—only accepts binary inputs. In other words, the thresholding remains as modeling part.

At this point, I need to edit my initial response because the 0.5 value isn’t applied to the F1 score calculation itself; 0.5 is used as the threshold for deciding whether a prediction is classified as a positive or negative label. This belongs to modeling part. The evaluation part, Scikit-learn's F1 score function only accepts binary labels. Sorry for causing confusion.

So the code flow is:

  1. Our baseline models (supervised ones) output probabilities
  2. Thresholding: Select the label (i.e. Positive vs Negative) - 0.5 applied here
  3. Using the binary label, we put it into sklearn's F1 calculation.

Therefore, the threshold is not part of the F1 score calculation, nor does the paper set a specific threshold value for the dataset. We choose 0.5 for supervised baseline models. Researchers using our dataset are free to choose any threshold they find optimal for their specific models.

Back to choosing threshold of 0.5, for our baseline models, we used a 0.5 threshold for two main reasons. First, our goal was to establish baseline performance without additional post-processing or methods that might boost results; we aimed for straightforward comparability. Second, extracting answer from decoder-only models like LLMs (unlike encoder-only models like BERT) is not straightforward and remains an actively researched area. This uncertainty makes it challenging to apply consistent thresholding across models, so we chose to leave it as is.

I hope this provides some clarity on our approach. Thank you for your valuable feedback!

Best, WonJin

empyriumz commented 1 week ago

Hi @wonjininfo,

Thank you for the clarification! I now understand that you have used 0.5 threshold when calculating the F1 score. It would be helpful if this point is clarified in the final version of your manuscript:)

wonjininfo commented 1 week ago

Hi @empyriumz Thank you so much for your comment! That’s a great suggestion, and I’ll include it in the manuscript.