Sizing reader study (radiologists versus standalone CAD)

Hi all!

For a project, we are setting up a reader study combined with an AI challenge and want to use the iMRMC software for analysis and sizing of our reader study.

Within our project we want to compare the performance of 20+ radiologists for prostate cancer detection in MRI against the best AI algorithms (from the challenge).

We want to perform a power analysis based on simulated pilot data and came across several questions while using the iMRMC software. With this post I want to present these issues and kindly ask if you are willing, and or have the time, to help us out.

Currently we are using pilot / simulated data to mimic a split-plot study design, in which a group of 20+ readers are divided among 2 or 3 groups and read 100 cases each (in total 200-300 cases depending on the number of readers, with 2/3 negative and 1/3 positive for cancer). We want to compare its performance with standalone CAD (formatted as an independent reader, reading under its own modality). We are able to make the software run the variance analysis and now want to use its variance estimations to run a power analysis.

This is where we encounter some problems and or uncertainties about the compatibility of its function regarding our study design.

• Based on this post a prior post (https://github.com/DIDSR/iMRMC/issues/166), we ran the software without changing the study design (if I understand correctly, the variance analysis accounts for the stand-alone reads of AI). Running this analysis provides us with very low power (=0.05) example_power_1

• Running the power estimation with e.g. 2 splits (as simulated in our pilot data) and “Paired Readers” set to “No” (radiologists and AI are independent readers), we see a more familiar power (=0.68) example_power_2

However, we are still uncertain if this analysis is performed correctly.

Another question that came to our minds is the distribution of the readers within our study design, and the distribution of readers as used in the power analysis. Within our design, readers are not equally distributed among groups and modalities (20+ readers in split-plot) against a single individual reader (or multiple AI reruns/reads to account for variance in training) that reads all cases.

Based on our initial tests, we do have some uncertainties about whether our ideas are implementable within the power analysis functionality of iMRMC.

In the attachment, I have added the .csv file used for our simulated data. The data consists of +/- 30 simulated radiologists (based on radiologist consensus), reading in a 2 split-plot, and +/- 14 AI reads (to account for variance obtained during training) on the same dataset. Radiologist performance is binarized, so we are familiar that AUC analysis is not accurate and reliable, but for now, used to test our study design ideas.

Many thanks for developing the software and hope to hear from you soon. Thanks in advance

pilot_data_AI_r31_split_plot2.csv .

Hello @jsprtwlt,

Thanks for posting your issue on Github, and thanks for sharing your simulated data. That really helps.

I don’t have a perfect answer or fix for your issue, but here’s what I know.

I’ve replicated your results. Your study design is interesting, not really something that I studied before. The iMRMC “Show Study Design” reports really helped me understand your study design: readers are one modality, and the corresponding study design is split plot. The readers in the second modality are not paired with the first, and the corresponding study design is fully crossed.

I see the default result in the sizing panel (S.E. = 1.774E-1) is not very harmonious with the result in the analysis panel (S.E. = 1.997E-2).
It seems intimately related to the “Paired Readers” setting. When I set this to “No”, the result in the sizing panel (S.E. = 2.041E-2) is harmonious with the result in the analysis panel (S.E. = 1.997E-2).
I believe the sizing panel will provide useful results for your example study design when the “Paired Readers” setting is set to “No”, but it hasn’t gone through any validation.
In this case, the sizing panel probably needs to be limited to a study design similar to the study design that creates the variance components, though I’d have to look more carefully.
Since you have a simulation, you can use it to size your study and validate iMRMC. Simulate the null hypothesis many times to create the distribution of performance differences (average will be close to zero). Use the empirical distribution to determine the empirical confidence interval. Simulate the alternative hypothesis many times to create the distribution of performance differences (average will be close the effect size in your model). Determine the power of your test (what is the fraction of performance differences from the alternative distribution that lie outside of the empirical confidence interval from the null hypothesis). You will likely have to redo this a few times changing the study design parameters (size of the experiment and effect size).
I’d be happy to check your work at the end. We could write a short paper on this together.

Thanks, brandon

Uploading questions by email ...

Hi Brandon,

Thanks for your reply and looking into this. If I understand correctly:

The sizing section provides probably fairly reliable results when setting the “Paired Readers” setting to no, however to be sure, this needs further validation.
We can only use the sizing tab to determine power for a similar study type as provided in the pilot data (as e.g. the reader variances are obtained from this). This means that if we want to experiment with various split plots / enrichments and number of readers, it is best to change our input data -> calculate variances -> use it in the sizing section?

Also some pending questions:

Does the sizing section distribute the number of readers equally among modalities and split groups? For example, 40 readers in the sizing panel means:
- 20 readers Modality 1 (10 in split group 1A, 10 in split group 1B) and 20 readers Modality 2 (10 in split group 2A, 10 in split group 2B)

In our study design we for example have 20 radiologists and 1 AI read, totaling 21 readers. It probably does not translate well to fill in 21 readers in the sizing panel. Can we overcome this by e.g. filling in 40 readers? (20 in mod 1, 20 in mod 2). The 20 readers in mod 2 basically are copied/identical reads obtained from the singular AI reader.

Considering our study design (20+ radiologists reading 100 cases in a 2 split-plot design compared to standalone CAD reading all 200 cases), does utilization of the split-plot design greatly influence the power of our study? Or might it be better to have all our readers stick to reading 100 cases?

Would appreciate your opinion on this, but can also imagine this exceeds the topics concerning the software itself.

In addition to clarify our simulation results, the readers are not representative for current practice and not accurate (obtained from a simple shuffle of labels from a consensus read to simulate various readers). For this reason we are not sure if it is reliable and useful to perform the additional test to validate the iMRMC results. Also, since we are working with a limited amount of time (due to setting up the reader study and corresponding AI challenge), we kindly refrain from doing this.

Thanks again for your time. Our study does feel like a unique research domain/design where only a few people have the right statistical knowledge of. Tough to get in contact with the right people so really appreciate your help.

For your data input, I believe that you can explore other study design options (# of split plots, number of readers and cases) as long as you mark “No” for “Paired Readers”. I still have to look into the sizing module when the input study design does not pair readers across modalities but the sizing module does. There seems to be a problem in that case.

It would be good if you had a better understanding of what the iMRMC software does. Please refer to this supplementary materials document (LINK) that can be found from this iMRMC GitHub repo wiki page https://github.com/DIDSR/iMRMC/wiki/iMRMC-Datasets and is stored at this GitHub repo https://didsr.github.io/viperData/ .

Quick overview of sizing method: The analysis section of the app estimates the fundamental variance components and the accompanying weights / coefficients for the input data. The sizing section changes the weights / coefficients based on the different study design choices. So, I would start by using as much data as you can “simulate” / estimate all the variance components the best that you can. Then I would explore different study design parameters (weights / coefficients) to see how they affect the subsequent total variance (or standard error). Following the example in the supplementary materials, I would do this exploration outside of the app so I have maximum flexibility and clarity what is happening. (BTW, when I size a study, I usually focus on uncertainty rather than the power from a hypothesis test. It's a little easier to track.)

The sizing section does distribute the readers and cases equally across modalities and split groups. Your example is correct only if “Paired Readers” is marked “No”. If “Paired Readers” is marked “Yes”:

40 readers in the sizing panel puts 40 readers in Modality 1 (20 in split group 1A, 20 in split group 1B) and the same 40 readers in Modality 2 (20 in split group 2A, 20 in split group 2B).

Regarding your strategy to "fake simulate" 20 algorithms in modality 2 is what I would do: "The 20 readers in mod 2 basically are copied/identical reads obtained from the singular AI reader." You could also simulate 20 algorithms that differ according to the training data or methods. This would give interesting information about the uncertainty of your algorithm arising from training.

The benefits of a split-plot study design depend on the variance components, but I have been impressed with the practical (experiment size) vs. statistical tradeoffs (endpoint uncertainty). Please check out the papers below. They should help you answer your question about the tradeoffs.

Page 6 of the supplementary pages mentioned above states, "The split-plot design was going to save about 75% of the reading time (75% of our costs) and with a moderate impact on precision."
The main manuscript also discusses the benefits: LINK.
This paper approaches the problem analytically. Chen2018_J-Med-Img_v5p031410

I understand your reluctance to over using a quasi-simulation. No problem. It was just a practical suggestion.

I would really appreciate knowing what you ultimately do and what your study looks like. Did the sizing analysis yield a study with the planned precision?

Good luck. Happy to help. I will likely move this issue to a discussion after you close it.

Brandon

Hi Brandon,

Many thanks for your help and explanations! Clear how we can continue with the sizing analysis. We will look into the supplementary materials and keep you posted when there are any updates.

Best,

Jasper

DIDSR / iMRMC

Sizing reader study (radiologists versus standalone CAD) #168