OHDSI / PheValuator

An R package for evaluating phenotype algorithms.
https://ohdsi.github.io/PheValuator/
17 stars 6 forks source link

Population Size from temp tables? #22

Closed SSMK-wq closed 4 years ago

SSMK-wq commented 4 years ago

Hello @jswerdel,

while I was trying to inspect the other errors, I came across another scenario.

image

As you can see, may I know from where does PheVal get the population size?

I understand that "cases" count is from "XSpec" cohort but from where does "Population Size" count come from?

Because it is not from "XSens" or "total no of records in our db" cohort for sure. Because the count from XSens is ~4100 and db count is ~5200.

When I looked at the temp tables count, none of the tables have this count of 3849.

Can I kindly request you to help us understand this? From where does it get the population size()?

2) Where can I find info on minimum number of subjects required to build a model? Is it shown in code anywhere? Yes, I understand that model built using 50 cases may not be useful. Is there anywhere in code where you have defined this condition?

jswerdel commented 4 years ago

The proportion of cases and non-cases is determined by the prevalence which is calculated using the prevalence cohort (probably the same as your xSens cohort). In your case, based upon previous notes, the prevalence was > 90% which is about the same as 3508/3849. The function adds non-cases in proportion to the percent of non-cases in the population, in this case about 9%.

SSMK-wq commented 4 years ago

Hi @jswerdel ,

No. Our records count is different. Please refer below the count of my cohorts defined. Am sure about this.

XSpec = 3508 subjects XSens = 4786 subjects Prevalence = XSens = 4786 subjects = 4786/5222 = 91% Noisy Negatives = 436 Total number of subjects in db = 5222 subjects

I understand the Noisy Negatives count is less but as the issue we are discussing here is different, hopefully it doesn't tie back to my dataset

But my question is, how did 3841 became the population size?

Even if I add noisy negatives to my XSpec, it is 3508 + 436 = 3944.

Hence a bit confused

The function adds non-cases in proportion to the percent of non-cases in the population, in this case about 9%.

Can help me understand this? I guess the detail lies in this message and help me understand how it became 3841?

jswerdel commented 4 years ago

3508/3849 (population modeled) ~= 4786/5222 (total population) ~= 91%. The modeling process in this case uses all the cases (xSpec = 3508) plus enough non-cases (noisy negatives) to bring the prevalence in that population to about the same as the population prevalence. In your case prevalence ~= 91%, 3508 cases available (xSpec) + 341 non-cases = 3849 total subjects in model population. 341/3849 ~= 9%

SSMK-wq commented 4 years ago

Fantastic. This issue can be closed. Thanks.

SSMK-wq commented 4 years ago

Hi,

but a quick question here.

a) I understand the XSpec cohort identified 3508 subjects as positive cases and based on this proportion of positive cases, it pulled in another 9% of subjects as negative cases to match the prevalence of full population. I understand the train and test split happens within this 3849. Am I right?

But what about rest of the subjects?

Meaning, my dataset has a total of 5222 records.

So, 5222 - 3849 (population size from previous step) = 1373 subjects

b) Why were these 1373 subjects left out. Meaning they weren't part of XSpec, so they could only be XSens or Noisy Negatives. But in our case based on +ve population of 91%, we only used another 9% of negative cases to make it 100% (to match full population prevalence). But these remaining subjects can't be used in evaluation cohort. Am I right?

c) Because evaluation cohort will only look for subjects under XSpec which were not used during model building. Am I right? So where does or will this 1373 subjects go?

jswerdel commented 4 years ago

a) yes. The rest of the subjects were not needed for the modeling. They will likely be used in the evaluation step (step 2) as this step only pulls in subjects not used in the modeling process.

b) They can, and likely will, be used in the evaluation step as these are the only subjects left in your cohort that weren't used in the modeling step.

c) the evaluation cohort is built specifically excluding ANY subjects used in the modeling step, bothe the noisy positives and noisy negatives.

SSMK-wq commented 4 years ago

@jswerdel

Can I kindly check with you on the below?

1) Any subject which doesn't fit within XSpec or XSens cohort should automatically get into Noisy Negatives. Am I right? And for model building we use XSpec and Noisy Negatives only. So Am I right to understand that these remaining 1373 are from XSens cohort? And it will always only be from XSens cohort because model always uses ALL XSpec and Noisy Negatives?

2) If you meant to say that evaluation cohort will contain only subjects from XSens cohort, then I guess there is a typo in the doc... CreateEvaluationCohort function has a parameter called XSpec

jswerdel commented 4 years ago

1) yes, also realizing that xSpec is a subset of xSens (at least as it was intended for use) The 1373 are any subjects from your overall cohort not used in the modeling process. These would included xSens and xSpec subjects and the rest of the noisy negatives so long they were not used in the modeling process. In most cases the model only uses a fraction of the xSpec subjects and Noisy negatives in a dataset. Your case is special. 2) The evaluation cohort building function requests the xSpec cohort id so it can include a small number of these as cases so the function will work correctly. These are then excluded at algorithm evaluation time (step 3).

SSMK-wq commented 4 years ago

1) May I know why does model leave out fraction/few/some of the XSpec, XSens and Noisy Negatives?

Because when I generate a XSpec cohort in Atlas and it produces 3500, I expect the model to use all those 3500 in the model.. Similarly for XSens and Noisy Negatives

On what basis and why does it leave out few records? I couldn't find this in doc anywhere.. Can help me understand this one last thing or direct me to resource where I can find info on this?

jswerdel commented 4 years ago

The function was designed to optimize the time for model building. It has been shown that about 1500 cases is a good number of cases to build a model - you do not improve your model performance by adding more cases (or improve it very little). Using less subjects in the model reduces the time to build the model. The function was designed to use about 1500 cases then add enough noisy negatives to get the prevalence of the HOI correct. So if the prevalence is 10%, the model would use 1500 cases and 13500 noisy negatives to get a prevalence of 1500/(1500+13500) = 10%. The selection process is done through random selection of subjects from the xSpec cohort and the noisy negatives.

SSMK-wq commented 4 years ago

Oh okay.. Like you mentioned earlier, Phevaluator only works when the data resembles real population... Meaning only when the prevalence of a disease in database matches with that of the real population (irrespective of the prevalence rate in any country), Phevaluator always uses only 1500 cases (upper ceiling) ... And for remaining 90% we fill in noisy negatives. Hopefully I got this right.

Thanks a ton for patiently answering all my queries. Much appreciated.. I understand there were lot of questions in past few days.. Kindly request you to let me know if you need any support with running some tests for Phevaluator, SQL, python etc.. I thank you for your time and patience.

jswerdel commented 4 years ago

Happy to help. Thank you for the kind offer of assistance.

SSMK-wq commented 4 years ago

Hi @jswerdel,

I understand how these numbers are determined but few questions again as I see some differences between what is present vs what is expected.

a) Total no of records in db = 5222 b) XSpec (5X) definition = 2311 c) XSens (1X) definition = 3089 d) Prevalence = 3089/5222 = 59.1%

When I use PheValuator, the first function (createEvaluationCohort) produces the below count of records

image

If I try to calculate the prevalence based on above, it is 2311/4036 = 57.2%

question 1

Why is this prevalence (57.2%) different from original prevalence of 59.1%? Meaning why did PheValuator chose the population size as 4036 instead of 3904? Meaning if it had chosen 3904, we would have got 2311/3904 = 59.1 which is same as original prevalence. I can then understand that the remaining (3904-2311 = 1593) records are pulled to resemble the original mix of positive and negative classes. Why 4036 instead of 3904?

question 2

Now under the same function (createEvaluationCohort), I see that the model starts prediction on 820 people as shown below

2020-04-05 15:39:14 Starting Prediction  2020-04-05 15:39:14 for  820  people
Removing infrequent covariates

But if we go by the population size (4036) used by the model in step 1, shouldn't I expect to see 5222 minus 4036 = 1186 records under evaluation cohort? Shouldn't the prediction be for 1186 people? Why and on what basis do we get only 820 records? I know it has a mix of Noisy Negatives and Noisy Positives. but I am trying to understand the count values. Where did rest of (1186 - 820 = 366) go? Is it not used in our Phenotype Assessment at all?

Hope i have put my questions in words properly. Can I kindly request you to help me with this?