BrianNathanWhite / OpenLong

Shares Synthetic Longitudinal Data And Code For Formatting Real Data
Other
2 stars 3 forks source link

Incidence Variables #6

Open simar7511 opened 3 months ago

simar7511 commented 3 months ago
bcjaeger commented 3 months ago

@BrianNathanWhite, this would be a good question for us to discuss with you. It looks like the ABC data had some variables in the baseline dataset that were 'incident' variables, e.g., 'incidence of MI'. I wasn't sure if the intention there was to create variables that indicate, e.g., "history of MI". The reason I got confused is that 'incident MI' implies an MI event occurs in the future, and the baseline data are intended to be cross-sectional

BrianNathanWhite commented 3 months ago

@bcjaeger @simar7511 So I've been trying to get to the bottom of this. When I initially processed the health ABC data I did not need to worry about what the coded-levels of the various categorical variables meant as I was simply translating the SAS code Jaime provided to process the data into R.

I was able to track down a pdf of the SAS Formats along with a data dictionary for the health ABC data; however, some of the variables we are interested in appear to be missing.

For instance, the CHDMI variable in the heath ABC data has the format subtitle "Incid MI" with levels 0 through 3; however CHDMI does not even show up in the data dictionary excel file or the SAS formats pdf.

I've run into the same issue with a number of the other blank variable entries in the shared key.

BrianNathanWhite commented 3 months ago

@bcjaeger @simar7511 Perhaps, the simplest thing to do, with regards to some of the ambiguous health ABC variables, would be to wait for Jaime to return and see if she can provide this info as she wrote the original SAS code doing the data pre-processing for health ABC and picked the variables to include in the processed data.

More generally, I think the question should shift from "What does this health ABC variable mean exactly" to "What variables are we interested in, in general, across the data sets?" For something like incidence of MI, we could start by noting all of the variables that could contain information relevant to that event (which could take many forms depending on how the question was phrased; e.g., ever-never, within the last X years, family history etc).

What are your thoughts?