DIDSR / iMRMC

iMRMC: Software to do multi-reader multi-case analysis of reader studies
http://didsr.github.io/iMRMC/
Other
22 stars 17 forks source link

Sizing reader study (radiologists versus standalone CAD) #168

Open jsprtwlt opened 2 years ago

jsprtwlt commented 2 years ago

Hi all!

For a project, we are setting up a reader study combined with an AI challenge and want to use the iMRMC software for analysis and sizing of our reader study.

Within our project we want to compare the performance of 20+ radiologists for prostate cancer detection in MRI against the best AI algorithms (from the challenge).

We want to perform a power analysis based on simulated pilot data and came across several questions while using the iMRMC software. With this post I want to present these issues and kindly ask if you are willing, and or have the time, to help us out.

Currently we are using pilot / simulated data to mimic a split-plot study design, in which a group of 20+ readers are divided among 2 or 3 groups and read 100 cases each (in total 200-300 cases depending on the number of readers, with 2/3 negative and 1/3 positive for cancer). We want to compare its performance with standalone CAD (formatted as an independent reader, reading under its own modality). We are able to make the software run the variance analysis and now want to use its variance estimations to run a power analysis.

This is where we encounter some problems and or uncertainties about the compatibility of its function regarding our study design.

• Based on this post a prior post (https://github.com/DIDSR/iMRMC/issues/166), we ran the software without changing the study design (if I understand correctly, the variance analysis accounts for the stand-alone reads of AI). Running this analysis provides us with very low power (=0.05) example_power_1

• Running the power estimation with e.g. 2 splits (as simulated in our pilot data) and “Paired Readers” set to “No” (radiologists and AI are independent readers), we see a more familiar power (=0.68) example_power_2

However, we are still uncertain if this analysis is performed correctly.

Another question that came to our minds is the distribution of the readers within our study design, and the distribution of readers as used in the power analysis. Within our design, readers are not equally distributed among groups and modalities (20+ readers in split-plot) against a single individual reader (or multiple AI reruns/reads to account for variance in training) that reads all cases.

Based on our initial tests, we do have some uncertainties about whether our ideas are implementable within the power analysis functionality of iMRMC.

In the attachment, I have added the .csv file used for our simulated data. The data consists of +/- 30 simulated radiologists (based on radiologist consensus), reading in a 2 split-plot, and +/- 14 AI reads (to account for variance obtained during training) on the same dataset. Radiologist performance is binarized, so we are familiar that AUC analysis is not accurate and reliable, but for now, used to test our study design ideas.

Many thanks for developing the software and hope to hear from you soon. Thanks in advance

pilot_data_AI_r31_split_plot2.csv .

brandon-gallas commented 2 years ago

Hello @jsprtwlt,

Thanks for posting your issue on Github, and thanks for sharing your simulated data. That really helps.

I don’t have a perfect answer or fix for your issue, but here’s what I know.

I’ve replicated your results. Your study design is interesting, not really something that I studied before. The iMRMC “Show Study Design” reports really helped me understand your study design: readers are one modality, and the corresponding study design is split plot. The readers in the second modality are not paired with the first, and the corresponding study design is fully crossed.

Thanks, brandon

brandon-gallas commented 2 years ago

Uploading questions by email ...

Hi Brandon,

Thanks for your reply and looking into this. If I understand correctly:

Also some pending questions:

In our study design we for example have 20 radiologists and 1 AI read, totaling 21 readers. It probably does not translate well to fill in 21 readers in the sizing panel. Can we overcome this by e.g. filling in 40 readers? (20 in mod 1, 20 in mod 2). The 20 readers in mod 2 basically are copied/identical reads obtained from the singular AI reader.

Would appreciate your opinion on this, but can also imagine this exceeds the topics concerning the software itself.

In addition to clarify our simulation results, the readers are not representative for current practice and not accurate (obtained from a simple shuffle of labels from a consensus read to simulate various readers). For this reason we are not sure if it is reliable and useful to perform the additional test to validate the iMRMC results. Also, since we are working with a limited amount of time (due to setting up the reader study and corresponding AI challenge), we kindly refrain from doing this.

Thanks again for your time. Our study does feel like a unique research domain/design where only a few people have the right statistical knowledge of. Tough to get in contact with the right people so really appreciate your help.

brandon-gallas commented 2 years ago

For your data input, I believe that you can explore other study design options (# of split plots, number of readers and cases) as long as you mark “No” for “Paired Readers”. I still have to look into the sizing module when the input study design does not pair readers across modalities but the sizing module does. There seems to be a problem in that case.

It would be good if you had a better understanding of what the iMRMC software does. Please refer to this supplementary materials document (LINK) that can be found from this iMRMC GitHub repo wiki page https://github.com/DIDSR/iMRMC/wiki/iMRMC-Datasets and is stored at this GitHub repo https://didsr.github.io/viperData/ .

Quick overview of sizing method: The analysis section of the app estimates the fundamental variance components and the accompanying weights / coefficients for the input data. The sizing section changes the weights / coefficients based on the different study design choices. So, I would start by using as much data as you can “simulate” / estimate all the variance components the best that you can. Then I would explore different study design parameters (weights / coefficients) to see how they affect the subsequent total variance (or standard error). Following the example in the supplementary materials, I would do this exploration outside of the app so I have maximum flexibility and clarity what is happening. (BTW, when I size a study, I usually focus on uncertainty rather than the power from a hypothesis test. It's a little easier to track.)

The sizing section does distribute the readers and cases equally across modalities and split groups. Your example is correct only if “Paired Readers” is marked “No”. If “Paired Readers” is marked “Yes”:

Regarding your strategy to "fake simulate" 20 algorithms in modality 2 is what I would do: "The 20 readers in mod 2 basically are copied/identical reads obtained from the singular AI reader." You could also simulate 20 algorithms that differ according to the training data or methods. This would give interesting information about the uncertainty of your algorithm arising from training.

The benefits of a split-plot study design depend on the variance components, but I have been impressed with the practical (experiment size) vs. statistical tradeoffs (endpoint uncertainty). Please check out the papers below. They should help you answer your question about the tradeoffs.

I understand your reluctance to over using a quasi-simulation. No problem. It was just a practical suggestion.

I would really appreciate knowing what you ultimately do and what your study looks like. Did the sizing analysis yield a study with the planned precision?

Good luck. Happy to help. I will likely move this issue to a discussion after you close it.

Brandon

jsprtwlt commented 2 years ago

Hi Brandon,

Many thanks for your help and explanations! Clear how we can continue with the sizing analysis. We will look into the supplementary materials and keep you posted when there are any updates.

Best,

Jasper