andkov commented 8 years ago

Current State

in the context of your proposed report structure and my comments to them, the following has been accomplished so far:

Section 1: Read in each of five data sets

Accomplished. The data provisioning documentation is contained in the Ellis Island report, which implements the following :

(1) Reads in raw data files from the candidate studies
(2) Extract, combines, and exports their metadata (specifically, variable names and labels, if provided) into ./data/shared/derived/meta-data-live.csv, which is updated every time Ellis Island script is executed.
(3) Augments raw metadata with instructions for renaming and classifying variables. The instructions are provided as manually entered values in ./data/shared/meta-data-map.csv. They are used by automatic scripts in later harmonization and analysis.
(4) Combines unit and metadata into a single DTO to serve as a starting point to all subsequent analyses.

Section 2: Relabel and transform variables (organized by data set; as discussed yesterday)

Accomplished partially.
To govern the metadata of the variables selected for the harmonization (150 items over 5 studies) a metadata table has been created and built-in into the project's ecosystem. Basically, it's a .csv table that maps existing properties of the variables as observed in the source files onto the additional, meta-data properties (e.g. type, construct, short_lable, new_item_name, etc.) With this number of items, it's impractical to fill in all possible meta-data, although of course, highly desirable. However, some metadata values are in a flux, because as you start exploring the item space, you change your categorization or labeling or renaming. If you don't plan to edit the csv and just want to look at the available metadata see this dynamic table. If you want to add or edit metadata edit this file

Section 3: Combine into single data set (include study level dummy variables)

Accomplished partially. This report shows what harmonized variable are available in a pooled dataset at the moment. Of course, provisional that the harmonization rules (and prerequisite categorizations of continuous variables) have been established as valid. (see my note on harmonization rules)

The structure of the scripts I have developed allows for modular development. The more variables are process, more harmonization rules are encoded, the larger the resultant combined dataset becomes.

Section 4: Estimate models (ever smoked as primary outcome to start; logistic regression)

not undertaken.

Section 5: Table results and compute odds ratio for covariates

not undertaken
Note on Harmonization rules (h-rules)

Each study has a group of variables (e.g. "alsa" = c("SMOKER", "PIPCIGAR") that contributes to computing a particular harmonized variable (e.g. smoke_now). Each of these schema sets will have a particular pattern of possible response values to these variables(e.g. c("SMOKER"="YES", "PIPCIGAR"="NO")), which we export for inspection as .csv tables. We then will manually edit these .csv tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. I've build the scripts to import harmonization algorithms encoded in .csv tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables. I think smoking is harmonized quite neatly, as for the other harmonized variables, the current values in the .csv maps are only place holders I was testing the script with.

Progress Report

I've harmonized smoking, age, sex, and marital status. Review report on the invididual items to offer your commentary. The raw form of each item is described in a describe report, while the harmonization procedure are carried out and reported in the harmonize report. The links give an example of harmonizing smoking construct.
I've produced the maps of possible responses and corresponding harmonization rules for education and work_status, but their h-rules will be messier than of what i've done so far.
I've developed the script to combine the existing harmonized variables into a single combined data set. Is it ready for analysis by R or Mplus.
I met briefly with Andrea today and described how these maps could be edited.
Resume
1. Harmonization rules remain to be developed for another five target variables (education, work status, alcohol, activity, health). The translation maps of the first two from this list can already be edited.
2. Alcohol, activity, and health. Many items require categorization, so creating the response maps cannot proceed. I'll be able to produce response maps for them sometimes on Friday, when I come back to this project.
3. If one agrees with the harmonization rules of the available harmonized variable, one can already use the combined data set for modelling. More harmonized variables will be added, but he computed values will not change (unless changed on the harmonization rule). The data made available in Mplus format and copious documentation exists about the procedure of the implemented harmonization, with instructions how to affect it.
4. I will be comfortable to presenting on the sections 1, 2, and 3 of the plan you've proposed There isn't enough time for me to do anything about sections 4, and 5, although I'm sure I'll get some ideas when I get some sleep and clear the backlog on other projects. ( In fact, I will require serious assistance to complete harmonization for alcohol, physical activity, and perceived health. There is just not enough hours in the day, even assuming a perfect substantive knowledge. Unassisted, however, it is reasonable to expect a combined and harmonized dataset that includes items for smoking, age, sex, marital status, and education level by the end of this Friday, 2016-04-08 ). Please let me know about the format of the presentation, how much time it's expected to last, and the audience will be expecting to hear from me. I'll start drafting as soon as I hear from you on this.
5. I have Portland and Amsterdam looming over me, so I can't think of Groningen until Friday. However, I'd be happy to meet with anyone to describe how one can contribute to harmonization rules development and how to start modeling with the available data.

ampiccinin commented 8 years ago

9

SMOKING:

ALSA - Judging from frequencies (90% NA), I'm guessing that PIPCIGAR is not asked unless someone says "yes" to the smoking Q. ...except that then it should be 91%, so maybe there are a few people who said "don't smoke" who in fact use a pipe or have the occasional cigar?

LBLS - I'm somewhat inclined to drop the 6 people who were inconsistent about their smoking responses (rows 1 and 6)

SATSA - do we know the order in which the questions were asked? * Do you want me to try to complete cells G15-G34? *

SHARE - looks fine

TILDA - By "undocumented code" do you mean that people responded something other than 98 (don't know) or 99 (refused)? (not sure how 3726 people don't know about whether they smoke now or not...) You could just drop line 8 (there is only one person). Seems like cell G8 should be "FALSE". I've added this. (do I need to tell you this, or will Excel edits show in GitHub?)

ampiccinin commented 8 years ago

MARITAL:

ALSA - fine.

LBLS - fine

SATSA - OK

SHARE - Just wonder whether "married, living separated from spouse" should instead be sep_divorced. Problem is we don't know why not living together - could be spouse is institutionalized? there are only 19 people in this category. "Separated" does not seem to be an available category for people to select. Either way, if cohabiting is main feature, then sep_divorced would be the more appropriate option I think.

TILDA - fine.

andkov commented 8 years ago

SMOKING

see my issue on smoking #2 , i've liked the images to Malestrome pages or meta documents at Obiba Wiki.

Yes, please save the changes to the .csv files (make sure you don't save save as .xlsx file) and group commits according to construct (e.g. all h-rule-smoking-*-.csv files as one, single commit). After you sync your commits the harmonization definitions will be come available from the cloud.

ampiccinin commented 8 years ago

11

AGE:

ALSA - age in 1992

LBLS - age in 1994

SATSA - age in 1991

SHARE - age in 2004

TILDA - age in 2009

Age in years is fairly uncontroversial. HOWEVER, age in SHARE and TILDA participants is measured a decade later than the other three. I suggest keeping year_born in the dataset and considering the impact of generation. Although the sampling strategies likely differ, we could at least compare smoking rates and models for Sweden (SATSA) and Ireland (TILDA) with SHARE.

andkov commented 8 years ago

11

AGE: I agree about keepingyear_born. The harmonization report for age brings each data set to have three variables:

  study_name id year_of_wave age_in_years year_born
1       alsa  1         1992           86      1906
2       alsa  2         1992           78      1914
3       alsa  3         1992           89      1903
4       alsa  4         1992           78      1914
5       alsa  5         1992           85      1907
6       alsa  6         1992           92      1900

Some studies have one, some another. The minimum requirement from each: these three.

ampiccinin commented 8 years ago

Need to add "anthropometrics" (i.e., BMI)

ampiccinin commented 8 years ago

EDUCATION:

ALSA - under 14 years is "less than high school"; 14 is high school; >14 is "more than high school"

LBLS - up to 11 years should be "less than high school", 12 is HS; 13 or more should be "more than high school"

SATSA - Elementary, Olevel, Vocational/Folk are "less than HS"; Gymnasium is HS; Univ or higher is "more than HS"

SHARE - None of the options appear to represent education beyond High school (lines 7-64 are all "secondary"). Are some options missing?

TILDA - None and Everything up to intermediate "less than HS"; LEaving certificate "HS"; Diploma and above "more than HS". this one I have edited, committed and synced.

ampiccinin commented 8 years ago

@andkov - Can you please add descriptives for physical activity, perceived health and BMI so I can help you with these?

andkov commented 8 years ago

@ampiccinin, will do , but I can't get started on it until Friday morning. Expect first results by Friday late morning.

ampiccinin commented 8 years ago

Do I understand correctly that the harmonized data are then to be pooled and analyzed within about a week and a half from now? Will that leave enough time? A

From

andkov commented 8 years ago

Yes, You understand the proposed timetable correctly.

The variables for which harmonization rules has been established and verified can already be combined to be analyzed. The more harmonized variables are in the combine data set the wider range of models we couldn't fit with them.

I will be able to spen Friday and Saturday on this. Sunday and Monday I'll have to spend on Amsterdam. So I'll do as much describing and harmonizing as I can so that other members of the exercise team can have richer data to model from. I'll be able to join the modelling effort after Multistate workshop is over.

ampiccinin commented 8 years ago

OK – thanks – current smoking, sex, age, marital status and education should be just about ready.

Please check, though, as I did not make all the edits I noted in the “Comments”, only the ones with commits. At first I was hesitant to make big changes to what you had already done, but then I realized that was not helpful. …but I did not go back to do the first ones as I was running out of time.

Alcohol is tricky. There is FREQUENCY and QUANTITY. It would actually be helpful to see a cross-tabulation or some other way of looking at the two together.

I’ll aim to do something for Alcohol and Retirement by Friday morning – or afternoon if you will be busy with the other covariates in the morning.

ampiccinin commented 8 years ago

ALCOHOL -

Not at all sure what to do with this:

SATSA - difference between GALCOHOL Do you ever drink alcoholic beverages? GEVRALK Do you ever drink alcoholic drinks?

andkov commented 8 years ago

back @ampiccinin

Need to add "anthropometrics" (i.e., BMI)

I like the construct anthropometrics to capture bmi, weight, height, leg length and such. I just started to use physique instead to refer to such measures collectively, but only because i haven't thought of anthropometrics, which I now like better. The only down side is typing a 15-letter word anthropometrics. Can you live with physique instead of anthropometrics?

ampiccinin commented 8 years ago

Back @andkov : lol - physique is fine.

I only used anthropometrics because I didn't think of physique!!

IALSA / ialsa-2016-groningen

2016-04-04 with A.Piccinin #8

Current State

Note on Harmonization rules (h-rules)

Progress Report

Resume

9

11

11