Open andkov opened 8 years ago
SMOKING:
ALSA - Judging from frequencies (90% NA), I'm guessing that PIPCIGAR is not asked unless someone says "yes" to the smoking Q. ...except that then it should be 91%, so maybe there are a few people who said "don't smoke" who in fact use a pipe or have the occasional cigar?
LBLS - I'm somewhat inclined to drop the 6 people who were inconsistent about their smoking responses (rows 1 and 6)
SATSA - do we know the order in which the questions were asked? * Do you want me to try to complete cells G15-G34? *
SHARE - looks fine
TILDA - By "undocumented code" do you mean that people responded something other than 98 (don't know) or 99 (refused)? (not sure how 3726 people don't know about whether they smoke now or not...) You could just drop line 8 (there is only one person). Seems like cell G8 should be "FALSE". I've added this. (do I need to tell you this, or will Excel edits show in GitHub?)
MARITAL:
ALSA - fine.
LBLS - fine
SATSA - OK
SHARE - Just wonder whether "married, living separated from spouse" should instead be sep_divorced. Problem is we don't know why not living together - could be spouse is institutionalized? there are only 19 people in this category. "Separated" does not seem to be an available category for people to select. Either way, if cohabiting is main feature, then sep_divorced would be the more appropriate option I think.
TILDA - fine.
SMOKING
Yes, please save the changes to the .csv files (make sure you don't save save as .xlsx file) and group commits according to construct (e.g. all h-rule-smoking-*-.csv files as one, single commit). After you sync your commits the harmonization definitions will be come available from the cloud.
AGE:
ALSA - age in 1992
LBLS - age in 1994
SATSA - age in 1991
SHARE - age in 2004
TILDA - age in 2009
Age in years is fairly uncontroversial. HOWEVER, age in SHARE and TILDA participants is measured a decade later than the other three. I suggest keeping year_born in the dataset and considering the impact of generation. Although the sampling strategies likely differ, we could at least compare smoking rates and models for Sweden (SATSA) and Ireland (TILDA) with SHARE.
AGE:
I agree about keepingyear_born
. The harmonization report for age brings each data set to have three variables:
study_name id year_of_wave age_in_years year_born
1 alsa 1 1992 86 1906
2 alsa 2 1992 78 1914
3 alsa 3 1992 89 1903
4 alsa 4 1992 78 1914
5 alsa 5 1992 85 1907
6 alsa 6 1992 92 1900
Some studies have one, some another. The minimum requirement from each: these three.
Need to add "anthropometrics" (i.e., BMI)
EDUCATION:
ALSA - under 14 years is "less than high school"; 14 is high school; >14 is "more than high school"
LBLS - up to 11 years should be "less than high school", 12 is HS; 13 or more should be "more than high school"
SATSA - Elementary, Olevel, Vocational/Folk are "less than HS"; Gymnasium is HS; Univ or higher is "more than HS"
SHARE - None of the options appear to represent education beyond High school (lines 7-64 are all "secondary"). Are some options missing?
TILDA - None and Everything up to intermediate "less than HS"; LEaving certificate "HS"; Diploma and above "more than HS". this one I have edited, committed and synced.
@andkov - Can you please add descriptives for physical activity, perceived health and BMI so I can help you with these?
@ampiccinin, will do , but I can't get started on it until Friday morning. Expect first results by Friday late morning.
Do I understand correctly that the harmonized data are then to be pooled and analyzed within about a week and a half from now? Will that leave enough time? A
From
Yes, You understand the proposed timetable correctly.
The variables for which harmonization rules has been established and verified can already be combined to be analyzed. The more harmonized variables are in the combine data set the wider range of models we couldn't fit with them.
I will be able to spen Friday and Saturday on this. Sunday and Monday I'll have to spend on Amsterdam. So I'll do as much describing and harmonizing as I can so that other members of the exercise team can have richer data to model from. I'll be able to join the modelling effort after Multistate workshop is over.
OK – thanks – current smoking, sex, age, marital status and education should be just about ready.
Please check, though, as I did not make all the edits I noted in the “Comments”, only the ones with commits. At first I was hesitant to make big changes to what you had already done, but then I realized that was not helpful. …but I did not go back to do the first ones as I was running out of time.
Alcohol is tricky. There is FREQUENCY and QUANTITY. It would actually be helpful to see a cross-tabulation or some other way of looking at the two together.
I’ll aim to do something for Alcohol and Retirement by Friday morning – or afternoon if you will be busy with the other covariates in the morning.
ALCOHOL -
Not at all sure what to do with this:
SATSA - difference between GALCOHOL Do you ever drink alcoholic beverages? GEVRALK Do you ever drink alcoholic drinks?
back @ampiccinin
Need to add "anthropometrics" (i.e., BMI)
I like the construct anthropometrics
to capture bmi
, weight
, height
, leg length
and such. I just started to use physique
instead to refer to such measures collectively, but only because i haven't thought of anthropometrics
, which I now like better. The only down side is typing a 15-letter word anthropometrics
. Can you live with physique
instead of anthropometrics
?
Back @andkov : lol - physique is fine.
I only used anthropometrics because I didn't think of physique!!
Current State
in the context of your proposed report structure and my comments to them, the following has been accomplished so far:
Accomplished. The data provisioning documentation is contained in the Ellis Island report, which implements the following :
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.Accomplished partially.
To govern the metadata of the variables selected for the harmonization (150 items over 5 studies) a metadata table has been created and built-in into the project's ecosystem. Basically, it's a
.csv
table that maps existing properties of the variables as observed in the source files onto the additional, meta-data properties (e.g. type, construct, short_lable, new_item_name, etc.) With this number of items, it's impractical to fill in all possible meta-data, although of course, highly desirable. However, some metadata values are in a flux, because as you start exploring the item space, you change your categorization or labeling or renaming. If you don't plan to edit the csv and just want to look at the available metadata see this dynamic table. If you want to add or edit metadata edit this fileAccomplished partially. This report shows what harmonized variable are available in a pooled dataset at the moment. Of course, provisional that the harmonization rules (and prerequisite categorizations of continuous variables) have been established as valid. (see my note on harmonization rules)
The structure of the scripts I have developed allows for modular development. The more variables are process, more harmonization rules are encoded, the larger the resultant combined dataset becomes.
Each study has a group of variables (e.g.
"alsa" = c("SMOKER", "PIPCIGAR")
that contributes to computing a particular harmonized variable (e.g.smoke_now
). Each of these schema sets will have a particular pattern of possible response values to these variables(e.g.c("SMOKER"="YES", "PIPCIGAR"="NO")
), which we export for inspection as.csv
tables. We then will manually edit these.csv
tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. I've build the scripts to import harmonization algorithms encoded in.csv
tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables. I thinksmoking
is harmonized quite neatly, as for the other harmonized variables, the current values in the .csv maps are only place holders I was testing the script with.Progress Report
smoking
construct.education
andwork_status
, but their h-rules will be messier than of what i've done so far.Resume