OHDSI / PheValuator

An R package for evaluating phenotype algorithms.
https://ohdsi.github.io/PheValuator/
17 stars 6 forks source link

Question - Best approach to generate XSpec, XSens cohorts? #20

Closed SSMK-wq closed 4 years ago

SSMK-wq commented 4 years ago

Hi @jswerdel,

Just to provide context and to be useful for other users to understand, I will start with an example as usual to avoid any confusions. Let's say I have a cohort of 5200 patients in which 90% of them T2DM. Rest 10% is T1DM or unspecified. Please note that , I don't have explicit labels as such. But we know that our cohort comprises mostly of T2DM patients (more than 90%).

Dataset info

XSpec >= 5 condition codes for T2DM - gives 57 subjects XSens >= 1 condition code for T2DM - gives 985 subjects Noisy Negatives = 4237 subjects (this violates our cohort characteristics described above. we know that 90% of T2DM. but let's discuss below)

Now my questions are

a) As you can see with the above cohort definition for XSpec, it gives inaccurate estimates of T2DM patient count (positive cases) in our dataset. Who decides on the XSpec cohort definition whether it is >=5 condition codes or >=10 condition codes? If it's gonna be a clinician, then they might provide me a list of rules for defining XSpec cohort definition? So we reaching out to clinician is like again asking them to do something like a manual chart review and arrive at rules to get a better estimate of XSpec. Am I right?

Because if we wish to get an accurate estimate of the T2DM (positive cases), I might have to again go for detailed phenotype algorithm which will consider all domains to identify a patient as T2DM or not. More rules, better and accurate estimate of T2DM population in our db. Right? Any other better way to do this?

b) So for XSpec cohort, should I use rule based phenotype algorithm (ex: PheKB) to identify highly positive cases? I identify positive and negative cases using PheKB, meaning I have a label now (better estimates for XSpec because we know that PheKB is validated across sites). Any suggestions here?

c) On what basis and how do you usually define XSpec and XSens cohorts at your end?Do you seek inputs from clinicians or use Phenotype Algorithm to define these cohorts itself? Have you ever encountered that sticking to conditions codes (>=10 or >=5) gives better estimates of positive cases under your population?

c) So now let's say I have the results of locally developed custom phenotype algorithm and it has returned for ex: 4900 patients as T2DM. So now to verify/assess the performance of this new local algorithm, should I compare them with PheKB labels and get sensitivity, specificity, etc?

d) As we don't have ground truth (provided by clinicians), we rely on PheValuator to help us provide some probabilistic values of whether they belong to positive or negative class. Am I right?

e) In case like above (d), May I kindly request you to let me know how can the labels generated using PheValuator model be called as probabilistic gold standard? it can only be called as gold standard when clinicians review it. Am I right? I know you use the term Probabilistic but is it a gold standard? Why is there a term called as gold standard ? Can I kindly request you to help me with this?

f) Target cohort = XSpec (57 subjects) + Noisy Negatives (4237 subjects) Outcome cohort = Label 1 (57 subjects) , Label 0 (4237 subjects).

But in reality, I know that there are lot of T2DM (positive - Label 1) cases in this 4237 subjects (who are noisy negatives based on my XSpec and XSens cohort defs).

g) I understand the number of subjects under each cohort is purely based on cohort definition that we have XSpec, XSens and Noisy Negatives. Right?

h) I might be wrong and kindly request you to correct me here. Only way to get accurate estimates of three cohorts can/may be possible only through Phenotype Algorithm (because only they give better estimates than compared to rules like >=5 codes or >=10 codes)

I) So, It's like using a Phenotype Algorithm like "PheKB" to build cohorts and assess the performance of new Phenotype Algorithm "Local algo". Are we trying to do something like this?

h) To me this approach seems to be not okay, because there might be scenarios where we might be interested to assess the performance of PheKB algo itself. In a case like this, I cannot create a XSpec cohort based on PheKB and again assess the performance of PheKB. That's not useful..Kindly request you to correct me here. I feel that I am kind of stuck in this infinite loop.. Haha

jswerdel commented 4 years ago

As we've discussed before, PheValuator was not designed for a cohort as you described. It requires a dataset where the health outcome of interest (HOI) is somewhat in proportion to what it is in a normal population, say for T2DM, 8% of the population has T2DM and 1% has T1DM. That being said, I will answer the questions as if a normal population is used. a) we generally keep the xSpecs simple. We feel that anyone that has 10 conditions codes in their record has a high probability of having the HOI. We have also used 5 codes and this works fine as well. The definition is often based on what your data will allow - you want to have as many subjects as possible in the xSpec, ideally at least 1500 to build a good model. If 10 codes only gives you 500 subjects, test with 5 codes and see if that is better, We generally do not use clinicians for this decision. b) You can use any definition that you want for xSpec that will give high confidence that you have a set of good noisy positives. Just remember you need to exclude all the covariates from the model building step so including drugs and measurements in your xSpec defintion will weaken the model - these covariates will need to be excluded and won't be able to be included in the model as predictors whhich may weaken the model's capability to assess the probability of a subject having the HOI. c) you would test the performance characteristics of the phenotype algorithm that you used to develop the 5200 person cohort to begin with on the original dataset where they were extracted, not of the extracted dataset. d) that is correct, PheValuator provides an estimate of the probability of a subject having the HOI, similar to a clinician reviewing the data available to him/her in the database record (e.g., by using the patient profile in ATLAS). e) we call it a probabilistic gold standard, you do not have to use that term. f/g) Not that you can use PheValuator on this dataset but in the way you described it there should be those with 0 codes for T2DM in your noisy negatives. It seems like these should only include your T1DM subjects (though in reality most T1DM subjects in claims/EHR datasets usually have 1 T2DM code). h) The xSpec cohort gives a valid estimate of the number of subjects with the HOI. However it is likely not the one you would use in your study. The performance characteristics of the xSpec usually show a high PPV but a very low sensitivity. So in your study you will likely use a phenotype algorithm with more balanced performance characteristics. i) the PheKB algorithm is a good algorithm to use for your actual study but may not be good for your xSpec cohort as it's PPV may be lower that a 10X condition code algorithm. I might use the 10X or 5X algorithm to build the model and then test the PheKB algorithm using PheValuator. j) see above

SSMK-wq commented 4 years ago

Hi @jswerdel ,

Thanks a ton for your responses. I understand PheValuator may not be suitable for our dataset, but can I ask a quick question.

a) Aren't we using Probabilistic phenotype to test the rule-based Phenotype? Is it only to save money and time (from being spent on clinician)?

I did see in YouTube video, a user asking similar question but would be good to hear from you as well

Am I right to understand that cohort definitions that we use for XSpec and XSens can also be called as Phenotype? And ultimately they when used in prediction model provide likelihood estimate a subject belonging to positive or negative class. Hence probabilistic phenotype. Am I right?

Hence can you help me understand why Probabilistic Phenotype to test Rule-based Phenotype?

Even if we use Probabilistic Phenotype, again we convert them into binary. Right?

Only difference is the cohort definition used to get probabilistic gold standard is different from phenotype algorithm that we intend to assess.

So basically there is only two ways to test the Phenotype algorithm.

a) One is through ground truth (with clinician inputs) b) Second is with Probabilistic gold standard (which can save money and time). That's the only option which is left. Am I right?

jswerdel commented 4 years ago

a) it saves time and money but, most importantly, it provides sensitivity which chart reviews cannot provide as the acquisition and review of a set of subjects large enough to determine sensitivity is likely not possible. For example, if the prevalence of the HOI was 1% and the presumed sensitivity was 75%, you would need to review 10K patient records to find 25 false negatives. Yes, the xSpec and xSens cohorts are developed using phenotype algorithms. We have used the term phenotype for the HOI, e.g., T2DM, and the term phenotype algorithm (or cohort definition) for the logic used to create a cohort from a database, e.g., 1X condition code for T2DM. xSpec and xSens are ultimately used to create a probability of the HOI for the subjects in the evaluation cohort. It works similarly to a probabilistic phenotype (see APHRODITE) but is not quite the same. "Even if we use Probabilistic Phenotype, again we convert them into binary. Right?" - as discussed previously, we use the probability to fill the 2X2 table to determine the performance characteristics instead of a cut-point to make it binary. The tool does allow this option though.

There are likely other ways to test phenotype algorithms but the two methods you mentioned are about all we have at this point.

SSMK-wq commented 4 years ago

This issue can be closed. Thanks a ton

SSMK-wq commented 4 years ago

Hi @jswerdel,

Can I check with the below?

As we don't have ground truth (provided by clinicians), we rely on PheValuator to help us provide some probabilistic values of whether they belong to positive or negative class. Am I right?

Though you did answer this above, I have few quick questions (let's forget all assumptions/notes about my dataset)

1) If clinicians have to label our dataset, they are still going to come up with some rules to classify a patient as having "HOI" or "Non-HOI". Am I right? It's not like they don't start with any set of preconceived rules and just treat it on patient-patient basis. If they have a rule, they will apply this rule to all patients in the database and voila,we get the ground truth. Or by ground truth, do you mean that clinicians will treat each subject one by one by looking at all their records and labeling them as "HOI" or "Non-HOI". Are both approaches valid and can be used to generate ground truth? Or when you mean clinical adjudication, does it only mean taking each subject one by one, going through all of his/her records and creating a label for him. Repeating the same for 5k patients.

2) There is no need for PheValuator when I have ground truth. Am I right? I can calculate the performance characteristics easily by matching subject_ids. Right?

3) Let's say that I have to assess the PA ("Magic") on external dataset where they don't have labels. In such case, PheValuator could be useful. Am I right? Because we have ground truth. We build a model using PheValuator which outputs probabilistic values as labels. Later I take this model (after checking the model performance) and run it on any external dataset (where they don't have labels currently) and generate labels for them. Am I right to understand this?

4) Though the above point 3 is not the main objective of PheValuator, it could still be used for that. Am I right? Meaning I could build my own model and share my model to different sites and they can use that to generate labels. If I am going to give my model (which was trained on ground truth) to other sites, then all sites will have their labels. Then sensitivity, specificity, PPV, NPV all can be calculated manually by just comparing the subject_ids with true labels to subject_ids in phenotype cohort. Am I right?

5) So, The Main and only objective/use of PheValuator is to get the performance characteristics of PA algorithm. Since most of the datasets (I mean almost all) don't have labels for their data, PheValuator has incorporated the modeling part to make it kind of a one stop solution addressing the labeling and performance characteristic issue.Am I right?

jswerdel commented 4 years ago

yes - PheValuator provides an estimate of the "ground truth" for a HOI 1) Clinicians going through the exercise of evaluating patient records would start with the clinical case definition for HOI. Then create a protocol on how to translate the case definition into something that can be used with the data in the dataset. For example, if a dataset does not have measurements and the case definition uses measurements, some proxy must be used to replace the measurement in the case definition. They would then use this protocol to adjudicate each of the patient records. In practice, due to cost and time constraints, this is usually only done on subjects that were selected by the phenotype algorithm. This produces an estimate of the PPV. It does not provide an estimate of the sensitivity which is a significant limitation of this method. 2) That is correct with the caveat that you have the ground truth for all patients, not just those selected by the phenotype algorithm. 3) We recommend building a model for each dataset as they all have different characteristics. Then apply the model from each dataset to a large random sample of subjects from the dataset (and not in the model) to create an evaluation cohort. The evaluation cohort may then be used to test the Magic PA on each of the datasets, 4) I'm a little confused - is your Magic PA so magic that it provides the ground truth, positive and negative, for each subject. If that is the case, you do not need PheValuator. If the Magic PA is a complex heuristic algorithm designed by clinicians to select those with the distinct elements of the Magic PA, then you would still need PheValuator, as the Magic PA is not really magic and will make mistakes. 5) Yes, that sounds right.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

wrt point 4)

Let me detail the steps to give you a better picture.

1) Let's say I reached out to clinicians and found ground truth for a subset of my dataset . Like for 1000 records out of 5000 (dataset size). We give only subset of our records due to time and const constraints. 2) Next, I came to know about two rule based phenotype algorithms. (PheKB & Magic) and would like to validate them on my dataset 3) I decided to apply them on my dataset (Implemented in Atlas). 4) Let's say PheKB resulted in 66% as having HOI and Magic gave 91% as having HOI. 5) But we know it's simple accuracy metric which doesn't show complete performance of the algo 6) So I need to other characteristics of the algo like Sensitivity, Specificity, PPV, NPV etc 7) Now as I have ground truth for 1000 patients in my dataset. These two rule based algorithms (PheKB & Magic) would have assigned some class to these 1000 patients as well (in their result set). Right? Now, I can find out the characteristics (sensitivity, specificity, PPV,NPV) of those two algorithms by comparing the subject_ids with ground truth.Am I right? 8) Do you think this may not be an sufficient enough to determine the characteristics of algorithm because we are only looking at the sample of 1000 patients? Why not and What do you suggest here? Should PheValuator be used here and if yes, for what?

a) Because when I have ground truth for 1000 patients, nothing can stop me from following the same case definition or approach to label the remaining 4000 patients. Am I right? Is it incorrect to do it that way?

b) Whether it's a subset (1000) with ground truth or full population (5000) with ground truth, we don't need to use PheValuator to get the characteristics? Either I be happy by knowing the algorithm performance assessed on a small sample or extend the clinician rules in step 1 to my remaining records in dataset which can ultimately help me get a full picture of algorithm performance.

c) I should use PheValuator only If I don't have ground truth for ANY ANY of my records in dataset. Am I right?

d) The models that we build using PheValuator are only supposed to be applied on evaluation cohort of my dataset? But why? We have OMOP CDM dataset. Our external partner also has OMOP CDM dataset. The model that I develop here at my place, can be sent to his place to generate the labels. Am I right? But why do you think that each site should have a separate model? Both are EHR data sources

jswerdel commented 4 years ago
  1. That is correct. But remember that if the prevalence of the HOI is 1% and a guess on the sensitivity is 75%, then in 1000 randomly selected subjects, you would expect to find 2.5 false negatives and 7.5% true positives. These are very small numbers to calculate your algorithm performance.
  2. see #7 - these do seem small

a) That is true. It may be tough to find clinicians to review 5000 subjects but may be worth a try. In that case you will have about 38 TPs and 12 FNs - still kind of small. However, now you have the complete set exactly labeled so there is no need for PheValuator. b) yes - you can use the ground truth on the 5000 subjects to then test algorithms keeping in mind the small numbers used for assessment. c) You might want to consider using PheValuator if you feel that the numbers are too small (with wide 95% confidence intervals) for the PA evaluation. d) They are only meant to be applied to the dataset where the model was built. If another site is using the same dataset then they can use the model to apply to their dataset. When we develop a model on multiple datasets and use each of the individual models to build an evaluation cohort for each of the datasets, we have found very different results from testing the same PA on different the different datasets. Consider a hypothetical dataset that has a robust set of conditions, procedures, drug exposures, and measurements vs. one that only has conditions. The first dataset will produce a very robust model for which to create an evaluation cohort to test your PAs. The second will likely produce a less robust model - the model will not be able to discriminate between those with the HOI and those without as well as the first model. The main point is just because you convert a dataset to the OMOP CDM it does not guarantee that the data is any good.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

Can you elaborate on point c).

For ex, In your paper, I see figure and tables on values for sensitivity, specificity, PPV, NPV and probabilistic labels. but I don't see any confidence interval value range for 95% for PheValuator output (except for papers/studies that you compare your PheValuator results with). Can help me understand point c) in ordinary layman terms (de-jargoning) with an example?

2) I was also reading about APHRODITE based on your suggestion in another post. I see they have ANCHOR terms. But may I know from you what was the motivation for PheValuator and how different it is from APHRODITE? What does PheValuator do which APHRODITE doesn't do? I see that APHRODITE also has a model (semi-supervised) etc but trying to understand the difference. Can help please?

jswerdel commented 4 years ago

In the output file from the latest version there are 95% confidence intervals included. I will update the example file on-line.

The 95% confidence intervals are calculated using the number of subjects in the test. The larger the number of subjects, the smaller the 95% confidence intervals. For example, if the number of subjects included in your phenotype definition is 100 and the PPV is 0.7, the 95% CI will be 0.61-0.79. If the number of subjects is 2000, the 95% CI will be 0.68-0.72.

If you are testing the performance of the phenotype algorithm using chart evaluation and examined 100 charts, the 95% CI would be broader than if you used PheValuator which examined, say, 2000, subjects.

2) PheValuator is very different than APHRODITE. It was designed to test phenotype algorithms (heuristic algorithms). APHRODITE was designed to replace the use of heuristic algorithms, i.e., replacing the use of deterministic algorithms with probabilistic algorithms. In other words, APHRODITE was designed to allow you to build cohorts based on probability, usually by designating a cut-point.

cgreich commented 4 years ago

PheValuator is very different than APHRODITE. It was designed to test phenotype algorithms (heuristic algorithms). APHRODITE was designed to replace the use of heuristic algorithms

Sounds like very close to me. Except you use one for making cohorts, and the other for checking on cohorts. I think it would make sense to discuss this in the documentation, so people don't get confused. You could even join forces in improving the algorithms, but I am not sure Aphrodite is still actively developed.

SSMK-wq commented 4 years ago

@jswerdel , I understand that PheValuator is a tool which can provide us the performance characteristics for a Phenotype algorithm. But to verify/assess the PheValuator tool itself, we need to get our data manually labelled. At least a subset of our dataset. Am I right? Or I can trust based on your results in paper (where you have compared PheValuator performance vs rule based algo performance from previous studies) and just use PheValuator as is on our dataset without verification? Am I right?

jswerdel commented 4 years ago

That would be a good approach and we would like to see those results. But, again, you will likely only get PPV for a small sample set.

The other thing is we describe PheValuator as a replacement for clinical adjudication of the data as found in the dataset. PheValuator can only assess the data in the dataset for determination of the probability of a subject having the HOI. This is similar to a clinician reviewing the ATLAS patient profile while being masked/blinded to the codes used to determine the xSpec. We use this approach as we feel that getting full patient charts for a large set of subjects for every database (as algorithms will work differently depending on the quality of the data in different datasets) is not possible.

SSMK-wq commented 4 years ago

Hi,

Thanks for the response.

1) Am I right to understand that PheValuator uses Binary variables (drug, conditions, measurements etc) as features whereas APHRODITE uses frequency count of each concepts from drug, measurement and conditions domain etc.

2) Similar to PheKB, where can I find the OMOP definitions for HOI? I guess like PheKB, OMOP also has a repository which has phenotype definitions. I couldn't locate that repository. Can I kindly request you for the URL to access the OMOP repository?

3) In addition, I was reading about APHRODITE and few related papers. I am aware that you aren't the author for APHRODITE but may I know whether the APHRODITE github is active? I have few questions on APHRODITE. Will you be able to help? Since it's all kind of related, thought of checking with you (as you would have tried APHRODITE?)

jswerdel commented 4 years ago

I see that you have inquired on the APHRODITE forum. I think that @jmbanda will respond.

SSMK-wq commented 4 years ago

Thanks @jswerdel for your response.Very much appreciate your time and effort. Will wait to hear from @jmbanda.

jmbanda commented 4 years ago

Thanks @jswerdel . I will indeed respond on @SSMK-wq post on the APHRODITE github, which is still under active development @cgreich :) I am just slow rolling out updates, but I have plenty of new stuff coming soon.