DIDSR / iMRMC

iMRMC: Software to do multi-reader multi-case analysis of reader studies
http://didsr.github.io/iMRMC/
Other
22 stars 17 forks source link

Not fully crossed study - BDG warning #185

Open JessieGommers opened 2 weeks ago

JessieGommers commented 2 weeks ago

We conducted a reader study with 2 different reading conditions using two datasets, each containing 30 exams with a 1:1 ratio of malignant to normal cases. Each of the 37 readers participated in a single reading session, reviewing both datasets: one with condition 1 and the other with condition 2. Due to logistical constraints, our study design is not fully crossed. We know that we pay a statistical price for this, but hope that using 37 readers mitigates this.

afbeelding

We conducted iMRMC analyses using the Java iMRMC software for AUC, sensitivity, and specificity, but encountered warnings with the BDG method stating that the DF_BDG is below a minimum and has been set to 29.0.

e.g. for AUC: afbeelding

e.g. for specificity: afbeelding

This warning does not appear when we use the MLE analysis. We observed that the p-values of the DBG and MLE estimate differ, specifically specificity, which turned out to be significantly different for the 2 conditions when using BDG (p=0.0003 with warning) but not when using MLE (p=0.204).

We are uncertain which method would be more appropriate for our study. I understand MLE can avoid a total negative variance estimate. However, the total variance estimate with the BDG method does not seem to be negative. I would greatly appreciate your guidance on the best approach for our context.

brandon-gallas commented 1 week ago

I think I understand your application and question.

It looks like you are using the java gui. This is a static piece of software that is no longer being maintained. I recommend that you use the R package moving forward. You can find information here: iMRMC: Software to do Multi-reader Multi-case Statistical Analysis of Reader Studies | Center for Devices and Radiological Health (fda.gov)

The warning about the degrees of freedom (DF) is not a problem. If the DF estimates fall below the lower bound, they are set to the lower bound. The DF estimates have uncertainty in them, especially in cases where data is limited. In your case, the number of exams is 15+15+15+15 (2 case sets x 2 truths), which is small for ROC analysis. The lower bound of 29 (=30-1), comes from the number of signal-present cases for sensitivity, signal-absent cases for specificity, or the minimum of these for ROC. It’s a bit of a waste of effort to have 37 readers evaluate the same small number of cases. Please see this paper on split-plot studies:

There is certainly something funny about the specificity results. The DF_BDG for specificity is calculated as 0.93!!! That is not good … red flag. I wouldn't use any p-values from the software. Notice the DF_BDG is ~24 for AUC. That is healthy. My guess is that many readers are making the exact same interpretations on the signal-absent cases … little to no reader variability. I’m curious to know if this is true.

Your issue is causing me to think I should return a different error if DF_BDG is below 3 or even 5.

Without any more of the output or input data, it is hard to give more of a response. p-values are only one kind of output; they can be misinterpreted or completely inappropriate. Point estimates and confidence intervals tell a much more complete story. I don’t have a solution for you except to refer to the per-reader results . .. BUT ... It isn't entirely clear that the cases in the two datasets/modalities are independent or just different reading conditions. If they only differ by the reading condition, they should carry the same case ID.

Finally, I would avoid using the MLE results. They are not validated when the study design is not fully crossed, and I’ve observed weird results in such cases. Your data is not fully crossed. Your question is nudging me to remove the MLE results completely from the current software.