SuperSupermoon / MedViLL

MedViLL official code. (Published IEEE JBHI 2021)
MIT License
83 stars 11 forks source link

Open I dataset #3

Closed Christoforos00 closed 2 years ago

Christoforos00 commented 2 years ago

Hello,

Thank you for your paper and repo. I have a few of questions regarding your use of the OpenI dataset.

  1. The original OpenI dataset has around 1500 manual labels, from which you only kept 15. Is the code of this transformation somewhere in this repo?
  2. In the original OpenI dataset, around 40% of the cases are annotated as normal, but in the Figure 5 of your paper it seems that the "No Findings" tag appears only 4.43% of the times.
  3. In your json files (for example in MedViLL/data/openi/Train.jsonl), some times the label is set as an empty string. Does this mean that this instance belongs to the "Others" category?
  4. Since MedViLL can be asked to do multilabel classification, it is possible that the "No finding" label can be predicted together with another label. Isn't it wrong to predict "No finding" and "Pneumonia" together, since those labels are conflicting?

Thank you.

SuperSupermoon commented 2 years ago

Hello @Christoforos00, thanks for your questions.

  1. We used the CheXpert labeler to extract labels from the OpenI dataset. So, you can get it easily by following that repo.
  2. The original OpenI dataset contains a lateral view as well as a front view. However, as described in the Dataset section of Materials and Methods, we consider only the frontal unique study in our work. The same criteria apply to the OpenI dataset.
  3. Yes. As we answered in question 1, we used the CheXpert-labeler to extract labels in OpenI. However, since the two datasets have different labeling schemes, some cases cannot be extracted through the CheXpert-labeler and remain blank. We treat these empty spaces as the "Others" class. Detailed discussion is shown in the Dataset Analysis section of Results and Discussion.
  4. That's a good question. We perform multi-label classification and report the average AUROC and F1 scores for positive findings in all diagnostic categories. The results for each model show good performance (except the CNN&Transformer model), indicating that such conflicts occur with low chance for trained models.