Open andkov opened 8 years ago
back at @ampiccinin
ALSA - Judging from frequencies (90% NA), I'm guessing that PIPCIGAR is not asked unless someone says "yes" to the smoking Q. ...except that then it should be 91%, so maybe there are a few people who said "don't smoke" who in fact use a pipe or have the occasional cigar?
Yes, there are 41 of such individuals.
@andkov – as long as you document it, it would be OK to drop small groups like this that don’t readily fit with the focus of the question and can’t be harmonized with the other datasets.
@ampiccinin - ok. Let's put these documentation notes into issues like these, dealing with harmonization rules.
back @ampiccinin
LBLS - I'm somewhat inclined to drop the 6 people who were inconsistent about their smoking responses (rows 1 and 6)
the following response were decided to be inconsistent and were removed from the computation of the harmonized variable
back at @ampiccinin
SATSA - do we know the order in which the questions were asked?
I can only speculate that the order is as given in the documentation
Unfortunately, no, @andkov . Those are alphabetic, as far as I know.
I could try to track down the original data entry sheet on NACDA.
@andkov - Argh! Sorry – I put them in as comments, rather than issues, didn’t I?
Since at this point it will take you just as long to read them as one as the other, I will not re-type what I wrote, but will respond to the issues you started this morning so that next comments are sorted by issue (assuming I understand correctly).
Will you be starting an alcohol issue soon?
As I see new issues appear in emailed notification, I will use this as a prompt that the descriptives are available. In the meantime I will move on to other topics since I don’t know what else I could do for this at the moment.
back @ampiccinin
TILDA - By "undocumented code" do you mean that people responded something other than 98 (don't know) or 99 (refused)? (not sure how 3726 people don't know about whether they smoke now or not...) You could just drop line 8 (there is only one person).
The undocumented code
is the value in the original data.
dto[["unitData"]][["tilda"]] %>%
+ dplyr::group_by_("BH002") %>%
+ dplyr::summarise(count = n())
Source: local data frame [3 x 2]
BH002 count
(fctr) (int)
1 UNDOCUMENTED CODE 3727
2 Yes 1564
3 No, I have stopped 3213
Either the Maelstrom documentation about this item is incorrect, or wrong data has been passed down to the participants.
Looks like more info than we want right now. Stick with simpler dichotomy: current smoker/not
@ampiccinin
Before we can encode unique combination of response to categorical variables we need to have those categorical variables. There are two continuous variables related to smoking:
> dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)
name label
1 BR0030 how many years smoked
> dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)
name label
1 BH003 bh003 How old were you when you stopped smoking?
similar to harmonization rules, we can encode the decisions about how the continuous variable should be split up into an additional column of table (.csv file).
edit the files in ./data/meta/c-rules/
to provide correction to the categorization rule (edit the new column) or provide an alternative categorization rule (create a new column with a distinct, descriptive name).
The categorical variable created with this procedure will then be passed down to the data schema definition to create response-profiles so that harmonized rules could be declared.
@ampiccinin , to clarify the work flow. The h-rules for smoking for SHARE and TILDA will have to be revised after the categorization rule has been established. I don't see a work around or automating of this process, unfortunately.
@andkov
We don’t need that variable – we can just use GEVERSMK:
YES No (never or quit)
From: Andriy V. Koval [mailto:notifications@github.com] Sent: Friday, April 8, 2016 9:02 AM To: IALSA/ialsa-2016-groningen ialsa-2016-groningen@noreply.github.com Cc: Andrea Piccinin piccinin@uvic.ca Subject: Re: [IALSA/ialsa-2016-groningen] harmonize: SMOKING (#9)
@ampiccininhttps://github.com/ampiccinin
Categorization
Before we can encode unique combination of response to categorical variables we need to have those categorical variables. There are two continuous variables related to smoking:
dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)
name label
1 BR0030 how many years smoked
dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)
name label
1 BH003 bh003 How old were you when you stopped smoking?
similar to harmonization rules, we can encode the decisions about how the continuous variable should be split up into an additional column of table (.csv file).
Procedure
edit the files in ./data/meta/c-rules/https://github.com/IALSA/ialsa-2016-groningen/tree/master/data/meta/c-rules to provide correction to the categorization rule (edit the new column) or provide an alternative categorization rule (create a new column with a distinct, descriptive name).
The categorical variable created with this procedure will then be passed down to the data schema definition to create response-profiles so that harmonized rules could be declared.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/IALSA/ialsa-2016-groningen/issues/9#issuecomment-207493664
@andkov - Don’t need to bother with time since quit. Just ignore.
back @ampiccinin
We don’t need that variable – we can just use GEVERSMK:
The variable GEVERSMK
is not in SHARE or TILDA. I assume you confused it with the variable GEVRSMK
in SATSA. In which case, I don't understand your comment.
back @ampiccinin
Don’t need to bother with time since quit. Just ignore.
The variable
dto[["metaData"]] %>%
dplyr::filter(study_name=="tilda", name=="BH003") %>%
dplyr::select(name,label)
name label
1 BH003 bh003 How old were you when you stopped smoking?
has been ruled to be excluded from the data schema variables for harmonized variables operationalizing the construct smoking
Back @andkov – I just meant we can rely on the categorical variables only in each dataset (BR0020) (BEHSMOKER, BH002)
@ampiccinin, allow me to clarify things for myself. I interpret your comment
I just meant we can rely on the categorical variables only in each dataset (BR0020) (BEHSMOKER, BH002)
as the following decision :
The continuous variables
> dto[["metaData"]] %>% dplyr::filter(study_name=="share", name=="BR0030") %>% dplyr::select(name,label)
name label
1 BR0030 how many years smoked
> dto[["metaData"]] %>% dplyr::filter(study_name=="tilda", name=="BH003") %>% dplyr::select(name,label)
name label
1 BH003 bh003 How old were you when you stopped smoking?
are excluded from the data schema variables for the harmonized operationalization of the construct smoking
.
The instructions for the exercise ask to provide the reasons for excluding the proposed variables from the use in harmonized variable computation. Please document the reason for each (please edit this comment) :
BR0030
of SATSA is excluded because....BH003
of TILDA is excluded because...@andkov – yes, excluded.
Instructions for what exercise?
Edits:
@ampiccinin
instructions for what exercise?
I went back to re-read the instructions that was passed down to all teams and see that this actually a false memory. The instructions ask to document which studies to be included and explain inclusion and exclusion criteria.
and also if sub-selection of certain datasets is needed and why
.
It doesn't say explicitly to document what variables from the source data sets should be included into Data Schema variables and which should not. But it seems that it's implied. In my opinion, we nevertheless should provide such argumentation so that we don't come back to making the same decisions again. The describe
reports will contain info on ALL variables, however the harmonize
reports will determine which of those will be included in computing harmonized variables.
Let's type up these decisions in the issues and I'll transfer them to reports as text when I update them.
@ampiccinin , @smhofer
After implemented the suggested corrections to the harmonization rules for smoking
, the harmonization report for smoking has been updated to reflect them. Please review the harmonization rules for this construct, consulting the report for details when necessary.
In a comment below, please put "viewed and agreed". Hearing from both @ampiccinin and @smhofer will indicate to me we are ready to close this issue and accept the current state of harmonization of this variables as stable.
h-rules