IALSA / ialsa-2016-groningen

Maelstrom Harmonization Workshop. Assessing the impact of different harmonization procedures on the analysis results from several real datasets.
GNU General Public License v2.0
1 stars 0 forks source link

2016-04-04 with A.Piccinin #8

Open andkov opened 8 years ago

andkov commented 8 years ago

Current State

in the context of your proposed report structure and my comments to them, the following has been accomplished so far:

Section 1: Read in each of five data sets

Accomplished. The data provisioning documentation is contained in the Ellis Island report, which implements the following :

Section 2: Relabel and transform variables (organized by data set; as discussed yesterday)

Accomplished partially.
To govern the metadata of the variables selected for the harmonization (150 items over 5 studies) a metadata table has been created and built-in into the project's ecosystem. Basically, it's a .csv table that maps existing properties of the variables as observed in the source files onto the additional, meta-data properties (e.g. type, construct, short_lable, new_item_name, etc.) With this number of items, it's impractical to fill in all possible meta-data, although of course, highly desirable. However, some metadata values are in a flux, because as you start exploring the item space, you change your categorization or labeling or renaming. If you don't plan to edit the csv and just want to look at the available metadata see this dynamic table. If you want to add or edit metadata edit this file

Section 3: Combine into single data set (include study level dummy variables)

Accomplished partially. This report shows what harmonized variable are available in a pooled dataset at the moment. Of course, provisional that the harmonization rules (and prerequisite categorizations of continuous variables) have been established as valid. (see my note on harmonization rules)

The structure of the scripts I have developed allows for modular development. The more variables are process, more harmonization rules are encoded, the larger the resultant combined dataset becomes.

Section 4: Estimate models (ever smoked as primary outcome to start; logistic regression)

  • not undertaken.

Section 5: Table results and compute odds ratio for covariates

  • not undertaken

    Note on Harmonization rules (h-rules)

Each study has a group of variables (e.g. "alsa" = c("SMOKER", "PIPCIGAR") that contributes to computing a particular harmonized variable (e.g. smoke_now). Each of these schema sets will have a particular pattern of possible response values to these variables(e.g. c("SMOKER"="YES", "PIPCIGAR"="NO")), which we export for inspection as .csv tables. We then will manually edit these .csv tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. I've build the scripts to import harmonization algorithms encoded in .csv tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables. I think smoking is harmonized quite neatly, as for the other harmonized variables, the current values in the .csv maps are only place holders I was testing the script with.

Progress Report

ampiccinin commented 8 years ago

9

SMOKING:

ALSA - Judging from frequencies (90% NA), I'm guessing that PIPCIGAR is not asked unless someone says "yes" to the smoking Q. ...except that then it should be 91%, so maybe there are a few people who said "don't smoke" who in fact use a pipe or have the occasional cigar?

LBLS - I'm somewhat inclined to drop the 6 people who were inconsistent about their smoking responses (rows 1 and 6)

SATSA - do we know the order in which the questions were asked? * Do you want me to try to complete cells G15-G34? *

SHARE - looks fine

TILDA - By "undocumented code" do you mean that people responded something other than 98 (don't know) or 99 (refused)? (not sure how 3726 people don't know about whether they smoke now or not...) You could just drop line 8 (there is only one person). Seems like cell G8 should be "FALSE". I've added this. (do I need to tell you this, or will Excel edits show in GitHub?)

ampiccinin commented 8 years ago

MARITAL:

ALSA - fine.

LBLS - fine

SATSA - OK

SHARE - Just wonder whether "married, living separated from spouse" should instead be sep_divorced. Problem is we don't know why not living together - could be spouse is institutionalized? there are only 19 people in this category. "Separated" does not seem to be an available category for people to select. Either way, if cohabiting is main feature, then sep_divorced would be the more appropriate option I think.

TILDA - fine.

andkov commented 8 years ago

SMOKING

Yes, please save the changes to the .csv files (make sure you don't save save as .xlsx file) and group commits according to construct (e.g. all h-rule-smoking-*-.csv files as one, single commit). After you sync your commits the harmonization definitions will be come available from the cloud.

ampiccinin commented 8 years ago

11

AGE:

ALSA - age in 1992

LBLS - age in 1994

SATSA - age in 1991

SHARE - age in 2004

TILDA - age in 2009

Age in years is fairly uncontroversial. HOWEVER, age in SHARE and TILDA participants is measured a decade later than the other three. I suggest keeping year_born in the dataset and considering the impact of generation. Although the sampling strategies likely differ, we could at least compare smoking rates and models for Sweden (SATSA) and Ireland (TILDA) with SHARE.

andkov commented 8 years ago

11

AGE: I agree about keepingyear_born. The harmonization report for age brings each data set to have three variables:

  study_name id year_of_wave age_in_years year_born
1       alsa  1         1992           86      1906
2       alsa  2         1992           78      1914
3       alsa  3         1992           89      1903
4       alsa  4         1992           78      1914
5       alsa  5         1992           85      1907
6       alsa  6         1992           92      1900

Some studies have one, some another. The minimum requirement from each: these three.

ampiccinin commented 8 years ago

Need to add "anthropometrics" (i.e., BMI)

ampiccinin commented 8 years ago

EDUCATION:

ALSA - under 14 years is "less than high school"; 14 is high school; >14 is "more than high school"

LBLS - up to 11 years should be "less than high school", 12 is HS; 13 or more should be "more than high school"

SATSA - Elementary, Olevel, Vocational/Folk are "less than HS"; Gymnasium is HS; Univ or higher is "more than HS"

SHARE - None of the options appear to represent education beyond High school (lines 7-64 are all "secondary"). Are some options missing?

TILDA - None and Everything up to intermediate "less than HS"; LEaving certificate "HS"; Diploma and above "more than HS". this one I have edited, committed and synced.

ampiccinin commented 8 years ago

@andkov - Can you please add descriptives for physical activity, perceived health and BMI so I can help you with these?

andkov commented 8 years ago

@ampiccinin, will do , but I can't get started on it until Friday morning. Expect first results by Friday late morning.

ampiccinin commented 8 years ago

Do I understand correctly that the harmonized data are then to be pooled and analyzed within about a week and a half from now? Will that leave enough time? A

From

andkov commented 8 years ago

Yes, You understand the proposed timetable correctly.

The variables for which harmonization rules has been established and verified can already be combined to be analyzed. The more harmonized variables are in the combine data set the wider range of models we couldn't fit with them.

I will be able to spen Friday and Saturday on this. Sunday and Monday I'll have to spend on Amsterdam. So I'll do as much describing and harmonizing as I can so that other members of the exercise team can have richer data to model from. I'll be able to join the modelling effort after Multistate workshop is over.

ampiccinin commented 8 years ago

OK – thanks – current smoking, sex, age, marital status and education should be just about ready.

Please check, though, as I did not make all the edits I noted in the “Comments”, only the ones with commits. At first I was hesitant to make big changes to what you had already done, but then I realized that was not helpful. …but I did not go back to do the first ones as I was running out of time.

Alcohol is tricky. There is FREQUENCY and QUANTITY. It would actually be helpful to see a cross-tabulation or some other way of looking at the two together.

I’ll aim to do something for Alcohol and Retirement by Friday morning – or afternoon if you will be busy with the other covariates in the morning.

ampiccinin commented 8 years ago

ALCOHOL -

Not at all sure what to do with this:

SATSA - difference between GALCOHOL Do you ever drink alcoholic beverages? GEVRALK Do you ever drink alcoholic drinks?

andkov commented 8 years ago

back @ampiccinin

Need to add "anthropometrics" (i.e., BMI)

I like the construct anthropometrics to capture bmi, weight, height, leg length and such. I just started to use physique instead to refer to such measures collectively, but only because i haven't thought of anthropometrics, which I now like better. The only down side is typing a 15-letter word anthropometrics. Can you live with physique instead of anthropometrics?

ampiccinin commented 8 years ago

Back @andkov : lol - physique is fine.

I only used anthropometrics because I didn't think of physique!!