BrianNathanWhite / OpenLong

Shares Synthetic Longitudinal Data And Code For Formatting Real Data
Other
2 stars 3 forks source link

How to code race and ethnicity variables #8

Open bcjaeger opened 5 months ago

bcjaeger commented 5 months ago

Every study probably collects these differently and it will be important to specify how we'd like to create a race and ethnicity variable in each study for harmonization.

BrianNathanWhite commented 5 months ago

I agree. More generally, I think it might be better to abstract the process of picking standardized variables (names, levels, scale, etc) away from the health ABC data set. I think the health ABC variables are useful starting points but we shouldn't necessarily chose its particular variable choices as the standard.

Would it make more sense to collect the variables as they are for the data sets we are interested in first and then see what the intersection is across variables (once again, levels, scale) or should we determine this a priori?

I guess this is the fundamental question to answer so that we can code data_clean and data_derive.

bcjaeger commented 5 months ago

Would it make more sense to collect the variables as they are for the data sets we are interested in first and then see what the intersection is across variables (once again, levels, scale) or should we determine this a priori?

@BrianNathanWhite, this is a great idea. May I open up a separate issue to discuss this?

Re: the race and ethnicity variables, I wonder if we can do something programmatic. What do you think of this approach?

For a given study, we will include all race categories that account for

The categories that do not meet one of these criteria will be grouped into 'other'

E.g., if a study has 1000 participants and race categories A (n=250), B (n=500), C (n=200), and D (N=50), then the derived race variable for that study would include categories A, B, and other, where other is C and D.

I don't feel strongly about the 5% threshold or the n=250 threshold. We can discuss what would make the most sense for those numbers if you all like the logic of this approach.

BrianNathanWhite commented 5 months ago

@bcjaeger Yes, this makes sense to open as its own issue. Also, I like the programmatic approach. I think it is consistent with the overall philosophy of making the data processing algorithm transparent (no black box), easily debugged and modifiable (if needed).

bcjaeger commented 5 months ago

and modifiable (if needed)

This makes me wonder if we should make these two things inputs for the data cleaning function.

I.e., data_clean(race_group_min_proportion = 0.05, race_group_min_n = 250)