OHDSI / PheValuator

An R package for evaluating phenotype algorithms.
https://ohdsi.github.io/PheValuator/
17 stars 6 forks source link

Warning - No Non-Null Coefficients #19

Closed SSMK-wq closed 4 years ago

SSMK-wq commented 4 years ago

Hello @jswerdel ,

While I tried to execute the CreatePhenotypeModel, I had the below warning messages.

1) Can you help me understand what do they mean and why does it occur?

One confusing part is, I didn't make any changes to my cohorts (XSpec, XSens) in this function and I was able to see the AUC etc for this when I ran for 2-3 times. But when I restarted R and ran again, I had the below warning messages and there is no output. Can you help?

Meaning this happens intermittently and not always.

Warning: No non-zero coefficients
Getting predictions on train set
Warning: Model had no non-zero coefficients so predicted same for all population...
Prediction took 0.006 secs
Warning: Evaluation not possible as prediciton NULL or all the same values

2) Can XSpec and XSens be identical? Meaning for ex: 30 subjects presented in XSpec are the only subjects presented in XSens? (Meaning XSens minus XSpec = 0 subjects)

I know this seems practically possible but does tool allow this?

jswerdel commented 4 years ago

This error may occur if you have not excluded all the covariates that were included in the xSpec and xSens defintions from your models. For instance, in lung cancer, if you used the SNOMED code Primary malignant neoplasm of lung (concept id: 258369) and it's descendants in the xSpec defintion, a error like this may occur if you did not set the addDescendantsToExclude parameter to TRUE.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

I already excluded them. But it still I see this warning. For example, I copied all the concepts from the below lists. May I know why does it happen despite excluding concepts ?

This is how my code looks like

xSpecCohort = 103,
                                               cdmDatabaseSchema = "cdm",
                                               cohortDatabaseSchema = "results",
                                               cohortDatabaseTable = "cohort",
                                               outDatabaseSchema = "temp",
                                               modelOutputFileName = "Test_model_pheno",
                                               xSensCohort = 104,
                                               prevalenceCohort = 104,
                                               excludedConcepts = c(43531588,45769888,37016768,45763582,4221495,37018912,43531578,43531559,43531566,43531653,43531577,43531562,37016163,45769894,45757474,43531616,36712686,36712687,43531597,443732,443767,443733,43531564,45757280,45769906,4177050,4223463,43530690,45769890,37018728,45772019,45769889,37016349,45770880,45757392,45771064,45757447,45757446,45757445,45757444,45757363,45772060,36714116,45769875,4130162,45771072,4140466,45770830,45769905,45757435,43531651,45770881,4222415,45769828,376065,45757450,45770883,45757255,37016354,43530656,45769836,443729,43530689,45757278,37017432,4063043,43531010,4129519,45770831,43530685,45757499,443731,45770928,45757075,4226121,45769872,45769835,36712670,46274058,4142579,45770832,45773064,201826,4230254,4304377,4321756,4196141,4099217,201530,4151282,4099216,4198296,4193704,4200875,4099651,45766052,40482801,45757277,45757449,43531608,
                                                 43531588,45769888,37016768,45763582,4221495,37018912,43531578,43531559,43531566,43531653,43531577,43531562,37016163,45769894,45757474,43531616,36712686,36712687,43531597,443732,443767,443733,43531564,45757280,45769906,4177050,4223463,43530690,45769890,37018728,45772019,45769889,37016349,45770880,45757392,45771064,45757447,45757446,45757445,45757444,45757363,45772060,36714116,45769875,4130162,45771072,4140466,45770830,45769905,45757435,43531651,45770881,4222415,45769828,376065,45757450,45770883,45757255,37016354,43530656,45769836,443729,43530689,45757278,37017432,4063043,43531010,4129519,45770831,43530685,45757499,443731,45770928,45757075,4226121,45769872,45769835,36712670,46274058,4142579,45770832,45773064,201826,4230254,4304377,4321756,4196141,4099217,201530,4151282,4099216,4198296,4193704,4200875,4099651,45766052,40482801,45757277,45757449,43531608,
                                                 201530,201826,376065,443729,443731,443732,443733,443767,4063043,4099216,4099217,4099651,4129519,4130162,4140466,4142579,4151282,4177050,4193704,4196141,4198296,4200875,4221495,4222415,4223463,4226121,4230254,4304377,4321756,36712670,36712686,36712687,36714116,37016163,37016349,37016354,37016768,37017432,37018728,37018912,40482801,43530656,43530685,43530689,43530690,43531010,43531559,43531562,43531564,43531566,43531577,43531578,43531588,43531597,43531608,43531616,43531651,43531653,45757075,45757255,45757277,45757278,45757280,45757363,45757392,45757435,45757444,45757445,45757446,45757447,45757449,45757450,45757474,45757499,45763582,45766052,45769828,45769835,45769836,45769872,45769875,45769888,45769889,45769890,45769894,45769905,45769906,45770830,45770831,45770832,45770880,45770881,45770883,45770928,45771064,45771072,45772019,45772060,45773064,46274058),
                                               lowerAgeLimit = -500,
                                               upperAgeLimit = 1000,
                                               addDescendantsToExclude = TRUE

image

jswerdel commented 4 years ago

would you share the portion of the output that looks like: image

Sometimes this error will occur if there are too few cases. Thanks.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

Here it is. don't know what is different now.

image

jswerdel commented 4 years ago

This is likely due to the low number of cases. You generally can not build a model with only 57 cases. We will usually use at least 1000. In this case, additionally, if this is the T2DM dataset discussed either by you or one of your colleagues, the dataset itself is not amenable to use by PheValuator. The noisy negatives in this case may not be truly noisy negatives - they may also be T2DM subjects making the model impossible to fit.

SSMK-wq commented 4 years ago

Can XSpec and XSens be the same? XSpec cohort_id = 103 and XSens cohort_id = 103.

This gives me enough cases (985) and it works and I don't see the warning. But is it right to use this way? Are there any downsides to it?

In addition when I ran createEvalCohort function, I get the below output

image

May I check with you from where does it get that 101 people?

Because my Xpec cohort is 985 people. XSens is also 985 people and Noisy negativeS IS 4200+.

Why there is NA for AUC score?

I am just trying to understand the messages and everything even though I know my dataset is not suitable for PheValuator.

jswerdel commented 4 years ago

Yes they can be the same but the cases from the xSens, say 1X condition code, is not "extremely specific" for the HOI, meaning that many subjects with 1X condition code do not actually have the HOI. They have the code either as a rule-out code or by mistake in the coding. That's why we use multiple, e.g., 5X, condition codes to alleviate those possibilities.

The evaluation cohort (in Step 2) only brings in subjects that were not in the set of subjects used to build the model (in Step 1). It also leaves in those in the xSpec cohort by necessity (PLP software limitation) however these subjects will be removed at phenotype algorithm evaluation time (Step 3). I think that the only subjects getting into the evaluation cohort are those from the xSpec that will eventually get removed. If that is the case, you will have no subjects to evaluate your phenotype algorithms - which demonstrates why you can use PheValuator on this dataset ("even though I know my dataset is not suitable")

SSMK-wq commented 4 years ago

Hi,

Yes, I understand. Thanks for your help. Model was built using XSpec (985) and Noisy Negatives (4200+) which equals my full population.

For evaluation, PheValuator exlcudes all subjects which were used during model building leaving me with no subjects to test this model on.

But my question is how do I see a message that "prediction for 101 people" when there are no subjects at all. How did model arrive at that figure of 101 or an AUC value of 83.00 as shown in screenshot above? Is it even possible?

jswerdel commented 4 years ago

It looks like it only used 4179 - 57 = 4122 noisy negatives so not all the noisy negatives were used. My guess is that the 101 subjects are composed of 57 noisy positives and 44 noisy negatives. So there are some noisy negatives left but too few to calculate performance characteristics.

SSMK-wq commented 4 years ago

Hi,

No, As we discussed earlier, I modified my XSpec and XSens to be the same. So createPhenoTypeModel function produces an output like below

image

q1) May I know from where does it get population size of 5107? total no of records in db is 5222 though

And CreateEvaluationCohort produces an output like below

image

So, if we are gonna do 5107 - 985 = 4122 is the noisy negatives count.

How and from where can I find how many noisy negatives was used during model building? and

why do I see prediction on 101 subjects during evaluation cohort? Shouldn't it be Zero? Because all subjects must have been used during model building because we don't set any condition to exclude few noisy negatives or something like that

SSMK-wq commented 4 years ago

This issue can be closed. Thanks

SSMK-wq commented 4 years ago

Hi @jswerdel ,

I see in PheValuator code that you have minimum number of cases required for PheValuator based on prevalence of HOI in the dataset. I see that minimum is 1500 but do you have any minimum requirement for no of controls required as well? If controls less than 100, can the PheValuator still work? I know we discussed earlier for the case count.. But now I would like to know the control count. Thanks for your help

jswerdel commented 4 years ago

There has been lots of research on case counts but none on non-case counts of which I'm aware. Likely this is due to having a good number of non-cases is usually not an issue. But I'm guessing the same counts will apply.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

Thanks for the response. PheValuator doesn't impose any limit in the code as far as I inspected. Is there any limit imposed for non-cases (controls) for PheValuator to work? Will it work with 95:5 ratio for cases and controls?

jswerdel commented 4 years ago

No, I don't think it will work in your instance. My guess is that it would work with a 95:5 ratio of cases:non-cases if the number of non-cases was high (>1500, say) though I don't know this for sure. The idea is that the predictive modeling software needs more data (i.e., more subjects) in order to develop a better predictive model.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

Yes, you are right that for predictive models to perform better we need more data (for cases and non-cases). As my objective is to first run the PheValuator end-end on our dataset, I was trying to know the sample size constraints.

For ex: I referred your code files which is given below

if (popPrev >= 0.3) {
    xspecSize <- 4000  #use large xSpec size for higher prevalence values
  } else if (popPrev >= 0.2) {
    xspecSize <- 3000
  } else if (popPrev >= 0.1) {
    xspecSize <- 2000
  } else {
    xspecSize <- 1500  #use smaller xSpec size for lower prevalence values
  }

  # set the number of nosiy negatives in the model either from the prevalence or to 500K max
  baseSampleSize <- min(as.integer(xspecSize/popPrev), 500000)  #use 500,000 as largest base sample

I understand our dataset is quite different (due to imbalance etc), but considering the 95% prevalence (as per current data characteristics), like you said we might need 1500 cases atleast. But I see with just 985 cases, I am still able to run the PheValuator.

In addition, for non-cases, as we see from your code above, we have an option to design the no of cases that we need. But May I check with you whether this was updated recently?

So, I can modify the non-cases logic to suit my dataset as is. Right?

jswerdel commented 4 years ago

You can not adjust the non-cases. The package selects the number of cases to fit the prevalence then selects the number of non-cases to fill out the cohort population to test. In general, if the number of cases is low in your dataset you can run the model with fewer than 1500 cases. However the fewer the cases the poorer the model will perform due to lack of enough information on cases vs. non-cases.