TODO - Presentations + Publications Task: NLM TG Analysis Plan + Slides Outline

Presentations + Publications Task: NLM TG Conference Analysis Plan + Slides Outline Due Dates:

[x] 06/24/2019 (presentation date)
[x] Near final draft to @tdbennett by 06/18/2019

Goals
This issue is meant to provide a general overview of the work that I would like to get done and the preliminary results I'd like to have by the conference date. Rather than duplicate information, I have spent the last 2 days overhauling parts of the project wiki. Please review this as it should now contain enough detail to help understand the experiments described below.

The experiments described on the Wiki will be implemented with the following changes:

Phenotypes: At minimum, I will perform experiments on ADHD, sickle cell disease and sleep apnea. If there is enough time, all phenotypes will be processed.

Experiment 1

Will exclude knowledge engineered code sets
Will only perform False Negative and False Positive Error Rate for evaluation

Experiment 2

May only include 1 type of ontology mapping (i.e. OBO-EM)
Will only perform False Negative and False Positive Error Rate for evaluation

Experiment 3

Depending on the progress of the experiments 1 and 2, I may not end up performing this experiment before the conference. If it looks like I will be able to complete that work in time, I will update this issue.

SLIDES OUTLINE: Google Sheets Presentation Draft

Hi Tiffany-

Great work on the project plan and the wiki. I agree with your tentative decision to focus on experiments 1 and 2 for the conference. My comments below mostly relate to experiments 1 and 2.

1a. In each case, am I reading things right that Controls are those without the phenotype? Classes will be really imbalanced (positive in <1%) if that's the case. Maybe that's why you're focusing on FPR and FNR.

1b. Overall, experiments 1 and 2 will generate 2x2 tables. FNR and FPR are useful metrics to report, but all of the usual classification metrics will be available to you with trivial or no additional coding. I think some overall measures of performance (Brier's, F1, AUPRC [classes unbalanced]) would also be useful to the reader. Readers/reviewers will ask about precision and recall so you may as well have them in a table somewhere, even if it's in a supplement.

Table 3 seems to a) have duplicated rows and b) be missing a set of rows for SV Children. Am I seeing that correctly?
"Phenotype codes" (union of those with any code) vs. "phenotype" definitions (~intersection of those with all of the codes necessary to meet the phenotype). These are different enough that I expect that "within" comparisons of SV vs ST for one of phenotype {codes, definitions} to be more relevant than "between" comparisons. I could be missing something here.
I assume that the experiment 1 and 2 comparisons will be pediatric vs pediatric and adult vs adult only, is that right?
For experiment 3, class imbalance will still be there, so I would add Brier's, AUPRC, F1.

@tdbennett - thanks for the great feedback! I have updated the description of the analysis in hopes of trying to make things more clear. If you have a chance to take a peek, I would be very appreciative! 😄

In response to your questions:

1a. In each case, am I reading things right that Controls are those without the phenotype? Classes will be really imbalanced (positive in <1%) if that's the case. Maybe that's why you're focusing on FPR and FNR. 1b. Overall, experiments 1 and 2 will generate 2x2 tables. FNR and FPR are useful metrics to report, but all of the usual classification metrics will be available to you with trivial or no additional coding. I think some overall measures of performance (Brier's, F1, AUPRC [classes unbalanced]) would also be useful to the reader. Readers/reviewers will ask about precision and recall so you may as well have them in a table somewhere, even if it's in a supplement.

You are correct, I suspect that the classes would be extremely unbalanced. I do think that it is a good idea to include additional measures that explicitly account for this. In doing some research, I am thinking that the Matthews Correlation Coefficient may be the best option. What do you think? I am happy to add additional measures like those you mentioned as well as Cohen's Kappa.

Follow-up Question: I want to verify with you how the confusion matrix would be compiled, given that some phenotypes have both a case and control group and some only have a case group. This gives us the following two options where the first option calculates metrics separately for the case and control groups and the second option only calculates metrics for the cases:

<!DOCTYPE html>

Option	Definitions
1	TP: # patients in gold standard cohort (includes cases and controls)	TN: # patients NOT in gold standard cohort (includes all patients that are not a case or a control)
1	FP: # patients labeled as being in the gold standard cohort who were NOT actually gold standard cohort patients (includes all patients that are not a case or a control)	FN: # patients labeled as NOT being in gold standard cohort who were actually gold standard cohort patients (includes cases and controls)
2	TP: # patients in gold standard cohort (cases only)	TN: # patients NOT in gold standard cohort (controls only)
2	FP: # patients labeled as being in gold standard cohort who were NOT actually gold standard cohort patients (controls only)	FN: # patients labeled as NOT being in gold standard cohort who were actually gold standard cohort patients (cases only)

The challenge to choosing option 2 is how to handle situations where the phenotype does not have a control group. I'm thinking that the best solution is to choose Option 1 and calculate the metrics in terms of the cases (TP) but then also pool all control groups and 10,000 random patients to add noise for calculating TN/FP/FN. These metrics would be calculated separately within each population and for each phenotype. What do you think?

Table 3 seems to a) have duplicated rows and b) be missing a set of rows for SV Children. Am I seeing that correctly?

It was correct, but I can see how it did not come across that way. I completely overhauled all of the code set and experiment tables. What do you think?

"Phenotype codes" (union of those with any code) vs. "phenotype" definitions (~intersection of those with all of the codes necessary to meet the phenotype). These are different enough that I expect that "within" comparisons of SV vs ST for one of phenotype {codes, definitions} to be more relevant than "between" comparisons. I could be missing something here.

I agree! Comparisons within each phenotype will be more meaningful than between.

I assume that the experiment 1 and 2 comparisons will be pediatric vs pediatric and adult vs adult only, is that right?

Yes, absolutely! No between population comparisons will be made.

For experiment 3, class imbalance will still be there, so I would add Brier's, AUPRC, F1.

See response to question 1.

Hi @callahantiff, the wiki is more clear now, thanks.

re: metrics like the MCC, etc., I think there are several metrics people use in the setting of unbalanced classes and different investigator groups have their favorites. Reporting several (sounds great to add MCC, Cohen's kappa to the list I gave) avoids the circumstance where you didn't report a reviewer's favorite and also can add robustness to the results, i.e. "all metrics showed better performance of classifier X than Y..."
The updated wiki clarifies the confusion matrix/2x2, thank you. Options 1 and 2 sound good. I think I can see why you chose the SV_Exact_None group as the reference. I expect the Codes classifiers will be a superset of the Definitions classifiers and the All Domains classifiers will be a superset of the Only Condition classifiers.
Related to number 2 above, it still stands out as a ton of comparisons - all of the exact/fuzzy/none comparisons within each condition only/all clinical domains comparisons. To avoid getting lost in the minutia, I think keeping a steady eye on your question(s) will be really important. I think you are doing this, but what do you want to be able to say? Maybe all the ways you've improved on the phenotypes improve performance tremendously. Maybe (null) they don't improve performance sufficiently to justify the effort required.

@tdbennett - Thanks for your help in preparing for the talk. It went great! I'm going to note the comments you make above for follow-up in our next meeting.

callahantiff / PheKnowVec

TODO - Presentations + Publications Task: NLM TG Analysis Plan + Slides Outline #91