harmonize: SMOKING - Githubissues

andkov commented 8 years ago

back at @ampiccinin

ALSA - Judging from frequencies (90% NA), I'm guessing that PIPCIGAR is not asked unless someone says "yes" to the smoking Q. ...except that then it should be 91%, so maybe there are a few people who said "don't smoke" who in fact use a pipe or have the occasional cigar?

Yes, there are 41 of such individuals.

ampiccinin commented 8 years ago

@andkov – as long as you document it, it would be OK to drop small groups like this that don’t readily fit with the focus of the question and can’t be harmonized with the other datasets.

andkov commented 8 years ago

@ampiccinin - ok. Let's put these documentation notes into issues like these, dealing with harmonization rules.

andkov commented 8 years ago

back @ampiccinin

LBLS - I'm somewhat inclined to drop the 6 people who were inconsistent about their smoking responses (rows 1 and 6)

the following response were decided to be inconsistent and were removed from the computation of the harmonized variable

andkov commented 8 years ago

back at @ampiccinin

SATSA - do we know the order in which the questions were asked?

I can only speculate that the order is as given in the documentation

ampiccinin commented 8 years ago

Unfortunately, no, @andkov . Those are alphabetic, as far as I know.

I could try to track down the original data entry sheet on NACDA.

ampiccinin commented 8 years ago

@andkov - Argh! Sorry – I put them in as comments, rather than issues, didn’t I?

Since at this point it will take you just as long to read them as one as the other, I will not re-type what I wrote, but will respond to the issues you started this morning so that next comments are sorted by issue (assuming I understand correctly).

Will you be starting an alcohol issue soon?

As I see new issues appear in emailed notification, I will use this as a prompt that the descriptives are available. In the meantime I will move on to other topics since I don’t know what else I could do for this at the moment.

andkov commented 8 years ago

back @ampiccinin

TILDA - By "undocumented code" do you mean that people responded something other than 98 (don't know) or 99 (refused)? (not sure how 3726 people don't know about whether they smoke now or not...) You could just drop line 8 (there is only one person).

The undocumented code is the value in the original data.

 dto[["unitData"]][["tilda"]] %>% 
+   dplyr::group_by_("BH002") %>% 
+   dplyr::summarise(count = n())
Source: local data frame [3 x 2]

               BH002 count
              (fctr) (int)
1  UNDOCUMENTED CODE  3727
2                Yes  1564
3 No, I have stopped  3213

Either the Maelstrom documentation about this item is incorrect, or wrong data has been passed down to the participants.

ampiccinin commented 8 years ago

Looks like more info than we want right now. Stick with simpler dichotomy: current smoker/not

andkov commented 8 years ago

@ampiccinin

Categorization

Before we can encode unique combination of response to categorical variables we need to have those categorical variables. There are two continuous variables related to smoking:

> dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)
    name                 label
1 BR0030 how many years smoked
> dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)
   name                                             label
1 BH003 bh003  How old were you when you stopped smoking?

similar to harmonization rules, we can encode the decisions about how the continuous variable should be split up into an additional column of table (.csv file).

Procedure

edit the files in ./data/meta/c-rules/ to provide correction to the categorization rule (edit the new column) or provide an alternative categorization rule (create a new column with a distinct, descriptive name).

The categorical variable created with this procedure will then be passed down to the data schema definition to create response-profiles so that harmonized rules could be declared.

andkov commented 8 years ago

@ampiccinin , to clarify the work flow. The h-rules for smoking for SHARE and TILDA will have to be revised after the categorization rule has been established. I don't see a work around or automating of this process, unfortunately.

ampiccinin commented 8 years ago

@andkov

We don’t need that variable – we can just use GEVERSMK:

YES No (never or quit)

From: Andriy V. Koval [mailto:notifications@github.com] Sent: Friday, April 8, 2016 9:02 AM To: IALSA/ialsa-2016-groningen ialsa-2016-groningen@noreply.github.com Cc: Andrea Piccinin piccinin@uvic.ca Subject: Re: [IALSA/ialsa-2016-groningen] harmonize: SMOKING (#9)

@ampiccininhttps://github.com/ampiccinin

Categorization

Before we can encode unique combination of response to categorical variables we need to have those categorical variables. There are two continuous variables related to smoking:

dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)

name                 label

1 BR0030 how many years smoked

dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)

name label

1 BH003 bh003 How old were you when you stopped smoking?

similar to harmonization rules, we can encode the decisions about how the continuous variable should be split up into an additional column of table (.csv file).

Procedure

edit the files in ./data/meta/c-rules/https://github.com/IALSA/ialsa-2016-groningen/tree/master/data/meta/c-rules to provide correction to the categorization rule (edit the new column) or provide an alternative categorization rule (create a new column with a distinct, descriptive name).

The categorical variable created with this procedure will then be passed down to the data schema definition to create response-profiles so that harmonized rules could be declared.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/IALSA/ialsa-2016-groningen/issues/9#issuecomment-207493664

ampiccinin commented 8 years ago

@andkov - Don’t need to bother with time since quit. Just ignore.

andkov commented 8 years ago

back @ampiccinin

We don’t need that variable – we can just use GEVERSMK:

The variable GEVERSMK is not in SHARE or TILDA. I assume you confused it with the variable GEVRSMK in SATSA. In which case, I don't understand your comment.

andkov commented 8 years ago

back @ampiccinin

Don’t need to bother with time since quit. Just ignore.

The variable

dto[["metaData"]] %>%
 dplyr::filter(study_name=="tilda", name=="BH003") %>%  
 dplyr::select(name,label)

   name                                             label
1 BH003 bh003  How old were you when you stopped smoking?

has been ruled to be excluded from the data schema variables for harmonized variables operationalizing the construct smoking

ampiccinin commented 8 years ago

Back @andkov – I just meant we can rely on the categorical variables only in each dataset (BR0020) (BEHSMOKER, BH002)

andkov commented 8 years ago

@ampiccinin, allow me to clarify things for myself. I interpret your comment

I just meant we can rely on the categorical variables only in each dataset (BR0020) (BEHSMOKER, BH002)

as the following decision :

The continuous variables

> dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)
    name                 label
1 BR0030 how many years smoked
> dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)
   name                                             label
1 BH003 bh003  How old were you when you stopped smoking?

are excluded from the data schema variables for the harmonized operationalization of the construct smoking.

The instructions for the exercise ask to provide the reasons for excluding the proposed variables from the use in harmonized variable computation. Please document the reason for each (please edit this comment) :

variable BR0030 of SATSA is excluded because....
variable BH003 of TILDA is excluded because...

ampiccinin commented 8 years ago

@andkov – yes, excluded.

Instructions for what exercise?

Edits:

variable BR0030 of SATSA SHARE is excluded because.... for “current smoker” we do not need to know how long they smoked, only that they do not smoke now.
variable BH003 of TILDA is excluded because...for current smoker we do not need to know WHEN they quit, only that they DID quit (or never smoked).

andkov commented 8 years ago

@ampiccinin

instructions for what exercise?

I went back to re-read the instructions that was passed down to all teams and see that this actually a false memory. The instructions ask to document which studies to be included and explain inclusion and exclusion criteria. and also if sub-selection of certain datasets is needed and why.

It doesn't say explicitly to document what variables from the source data sets should be included into Data Schema variables and which should not. But it seems that it's implied. In my opinion, we nevertheless should provide such argumentation so that we don't come back to making the same decisions again. The describe reports will contain info on ALL variables, however the harmonize reports will determine which of those will be included in computing harmonized variables.

Let's type up these decisions in the issues and I'll transfer them to reports as text when I update them.

andkov commented 8 years ago

@ampiccinin , @smhofer

After implemented the suggested corrections to the harmonization rules for smoking, the harmonization report for smoking has been updated to reflect them. Please review the harmonization rules for this construct, consulting the report for details when necessary.

In a comment below, please put "viewed and agreed". Hearing from both @ampiccinin and @smhofer will indicate to me we are ready to close this issue and accept the current state of harmonization of this variables as stable.

IALSA / ialsa-2016-groningen

harmonize: SMOKING #9

Categorization

Procedure