Closed SSMK-wq closed 4 years ago
This error may occur if you have not excluded all the covariates that were included in the xSpec and xSens defintions from your models. For instance, in lung cancer, if you used the SNOMED code Primary malignant neoplasm of lung (concept id: 258369) and it's descendants in the xSpec defintion, a error like this may occur if you did not set the addDescendantsToExclude parameter to TRUE.
Hi @jswerdel ,
I already excluded them. But it still I see this warning. For example, I copied all the concepts from the below lists. May I know why does it happen despite excluding concepts ?
This is how my code looks like
xSpecCohort = 103,
cdmDatabaseSchema = "cdm",
cohortDatabaseSchema = "results",
cohortDatabaseTable = "cohort",
outDatabaseSchema = "temp",
modelOutputFileName = "Test_model_pheno",
xSensCohort = 104,
prevalenceCohort = 104,
excludedConcepts = c(43531588,45769888,37016768,45763582,4221495,37018912,43531578,43531559,43531566,43531653,43531577,43531562,37016163,45769894,45757474,43531616,36712686,36712687,43531597,443732,443767,443733,43531564,45757280,45769906,4177050,4223463,43530690,45769890,37018728,45772019,45769889,37016349,45770880,45757392,45771064,45757447,45757446,45757445,45757444,45757363,45772060,36714116,45769875,4130162,45771072,4140466,45770830,45769905,45757435,43531651,45770881,4222415,45769828,376065,45757450,45770883,45757255,37016354,43530656,45769836,443729,43530689,45757278,37017432,4063043,43531010,4129519,45770831,43530685,45757499,443731,45770928,45757075,4226121,45769872,45769835,36712670,46274058,4142579,45770832,45773064,201826,4230254,4304377,4321756,4196141,4099217,201530,4151282,4099216,4198296,4193704,4200875,4099651,45766052,40482801,45757277,45757449,43531608,
43531588,45769888,37016768,45763582,4221495,37018912,43531578,43531559,43531566,43531653,43531577,43531562,37016163,45769894,45757474,43531616,36712686,36712687,43531597,443732,443767,443733,43531564,45757280,45769906,4177050,4223463,43530690,45769890,37018728,45772019,45769889,37016349,45770880,45757392,45771064,45757447,45757446,45757445,45757444,45757363,45772060,36714116,45769875,4130162,45771072,4140466,45770830,45769905,45757435,43531651,45770881,4222415,45769828,376065,45757450,45770883,45757255,37016354,43530656,45769836,443729,43530689,45757278,37017432,4063043,43531010,4129519,45770831,43530685,45757499,443731,45770928,45757075,4226121,45769872,45769835,36712670,46274058,4142579,45770832,45773064,201826,4230254,4304377,4321756,4196141,4099217,201530,4151282,4099216,4198296,4193704,4200875,4099651,45766052,40482801,45757277,45757449,43531608,
201530,201826,376065,443729,443731,443732,443733,443767,4063043,4099216,4099217,4099651,4129519,4130162,4140466,4142579,4151282,4177050,4193704,4196141,4198296,4200875,4221495,4222415,4223463,4226121,4230254,4304377,4321756,36712670,36712686,36712687,36714116,37016163,37016349,37016354,37016768,37017432,37018728,37018912,40482801,43530656,43530685,43530689,43530690,43531010,43531559,43531562,43531564,43531566,43531577,43531578,43531588,43531597,43531608,43531616,43531651,43531653,45757075,45757255,45757277,45757278,45757280,45757363,45757392,45757435,45757444,45757445,45757446,45757447,45757449,45757450,45757474,45757499,45763582,45766052,45769828,45769835,45769836,45769872,45769875,45769888,45769889,45769890,45769894,45769905,45769906,45770830,45770831,45770832,45770880,45770881,45770883,45770928,45771064,45771072,45772019,45772060,45773064,46274058),
lowerAgeLimit = -500,
upperAgeLimit = 1000,
addDescendantsToExclude = TRUE
would you share the portion of the output that looks like:
Sometimes this error will occur if there are too few cases. Thanks.
Hi @jswerdel ,
Here it is. don't know what is different now.
This is likely due to the low number of cases. You generally can not build a model with only 57 cases. We will usually use at least 1000. In this case, additionally, if this is the T2DM dataset discussed either by you or one of your colleagues, the dataset itself is not amenable to use by PheValuator. The noisy negatives in this case may not be truly noisy negatives - they may also be T2DM subjects making the model impossible to fit.
Can XSpec and XSens be the same? XSpec cohort_id = 103 and XSens cohort_id = 103.
This gives me enough cases (985) and it works and I don't see the warning. But is it right to use this way? Are there any downsides to it?
In addition when I ran createEvalCohort function, I get the below output
May I check with you from where does it get that 101 people?
Because my Xpec cohort is 985 people. XSens is also 985 people and Noisy negativeS IS 4200+.
Why there is NA for AUC score?
I am just trying to understand the messages and everything even though I know my dataset is not suitable for PheValuator.
Yes they can be the same but the cases from the xSens, say 1X condition code, is not "extremely specific" for the HOI, meaning that many subjects with 1X condition code do not actually have the HOI. They have the code either as a rule-out code or by mistake in the coding. That's why we use multiple, e.g., 5X, condition codes to alleviate those possibilities.
The evaluation cohort (in Step 2) only brings in subjects that were not in the set of subjects used to build the model (in Step 1). It also leaves in those in the xSpec cohort by necessity (PLP software limitation) however these subjects will be removed at phenotype algorithm evaluation time (Step 3). I think that the only subjects getting into the evaluation cohort are those from the xSpec that will eventually get removed. If that is the case, you will have no subjects to evaluate your phenotype algorithms - which demonstrates why you can use PheValuator on this dataset ("even though I know my dataset is not suitable")
Hi,
Yes, I understand. Thanks for your help. Model was built using XSpec (985) and Noisy Negatives (4200+) which equals my full population.
For evaluation, PheValuator exlcudes all subjects which were used during model building leaving me with no subjects to test this model on.
But my question is how do I see a message that "prediction for 101 people" when there are no subjects at all. How did model arrive at that figure of 101 or an AUC value of 83.00 as shown in screenshot above? Is it even possible?
It looks like it only used 4179 - 57 = 4122 noisy negatives so not all the noisy negatives were used. My guess is that the 101 subjects are composed of 57 noisy positives and 44 noisy negatives. So there are some noisy negatives left but too few to calculate performance characteristics.
Hi,
No, As we discussed earlier, I modified my XSpec and XSens to be the same. So createPhenoTypeModel function produces an output like below
q1) May I know from where does it get population size of 5107? total no of records in db is 5222 though
And CreateEvaluationCohort produces an output like below
So, if we are gonna do 5107 - 985 = 4122 is the noisy negatives count.
How and from where can I find how many noisy negatives was used during model building? and
why do I see prediction on 101 subjects during evaluation cohort? Shouldn't it be Zero? Because all subjects must have been used during model building because we don't set any condition to exclude few noisy negatives or something like that
This issue can be closed. Thanks
Hi @jswerdel ,
I see in PheValuator code that you have minimum number of cases required for PheValuator based on prevalence of HOI in the dataset. I see that minimum is 1500 but do you have any minimum requirement for no of controls required as well? If controls less than 100, can the PheValuator still work? I know we discussed earlier for the case count.. But now I would like to know the control count. Thanks for your help
There has been lots of research on case counts but none on non-case counts of which I'm aware. Likely this is due to having a good number of non-cases is usually not an issue. But I'm guessing the same counts will apply.
Hi @jswerdel ,
Thanks for the response. PheValuator doesn't impose any limit in the code as far as I inspected. Is there any limit imposed for non-cases (controls) for PheValuator to work? Will it work with 95:5 ratio for cases and controls?
No, I don't think it will work in your instance. My guess is that it would work with a 95:5 ratio of cases:non-cases if the number of non-cases was high (>1500, say) though I don't know this for sure. The idea is that the predictive modeling software needs more data (i.e., more subjects) in order to develop a better predictive model.
Hi @jswerdel ,
Yes, you are right that for predictive models to perform better we need more data (for cases and non-cases). As my objective is to first run the PheValuator end-end on our dataset, I was trying to know the sample size constraints.
For ex: I referred your code files which is given below
if (popPrev >= 0.3) {
xspecSize <- 4000 #use large xSpec size for higher prevalence values
} else if (popPrev >= 0.2) {
xspecSize <- 3000
} else if (popPrev >= 0.1) {
xspecSize <- 2000
} else {
xspecSize <- 1500 #use smaller xSpec size for lower prevalence values
}
# set the number of nosiy negatives in the model either from the prevalence or to 500K max
baseSampleSize <- min(as.integer(xspecSize/popPrev), 500000) #use 500,000 as largest base sample
I understand our dataset is quite different (due to imbalance etc), but considering the 95% prevalence (as per current data characteristics), like you said we might need 1500 cases atleast. But I see with just 985 cases, I am still able to run the PheValuator.
In addition, for non-cases, as we see from your code above, we have an option to design the no of cases that we need. But May I check with you whether this was updated recently?
So, I can modify the non-cases logic to suit my dataset as is. Right?
You can not adjust the non-cases. The package selects the number of cases to fit the prevalence then selects the number of non-cases to fill out the cohort population to test. In general, if the number of cases is low in your dataset you can run the model with fewer than 1500 cases. However the fewer the cases the poorer the model will perform due to lack of enough information on cases vs. non-cases.
Hello @jswerdel ,
While I tried to execute the
CreatePhenotypeModel
, I had the below warning messages.1) Can you help me understand what do they mean and why does it occur?
One confusing part is, I didn't make any changes to my cohorts (XSpec, XSens) in this function and I was able to see the AUC etc for this when I ran for 2-3 times. But when I restarted R and ran again, I had the below warning messages and there is no output. Can you help?
Meaning this happens intermittently and not always.
2) Can XSpec and XSens be identical? Meaning for ex: 30 subjects presented in XSpec are the only subjects presented in XSens? (Meaning XSens minus XSpec = 0 subjects)
I know this seems practically possible but does tool allow this?