alapo / Literate-Programming

This is the repository for the Literate Programming Workshop
39 stars 12 forks source link

Problem with turning a numeric variable into a factor #13

Closed chelseachristie closed 3 years ago

chelseachristie commented 3 years ago

Hi Andrew,

I'm trying out the skills from module 05 with some birth data that I imported from a Stata file. I have a four-level categorical variable for maternal age (age 15-19, 20s, 30s, or age 40+) that was a "numeric" type and I tried to make it a "factor" instead. It looks like it worked, but then it won't let me group infant birth weight (tgrams) by this maternal age variable (mom_age_cat). It looks like all the mom_age_cat values are missing? But the dataset I imported had no missing data on this variable...

df <- rio::import("data/NC_birth_data_PGME_edited.dta") 
str(df)
'data.frame':   800 obs. of  15 variables:
 $ plural     : num  1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "plural"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ sex        : num  1 2 1 1 1 1 2 2 2 2 ...
  ..- attr(*, "label")= chr "sex"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ mage       : num  32 32 27 27 25 28 25 15 37 21 ...
  ..- attr(*, "label")= chr "mage"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ weeks      : num  40 37 39 39 39 43 39 42 41 39 ...
  ..- attr(*, "label")= chr "weeks"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ marital    : num  1 1 1 1 1 1 1 2 1 1 ...
  ..- attr(*, "label")= chr "marital"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ racemom    : num  1 1 1 1 1 1 1 1 8 1 ...
  ..- attr(*, "label")= chr "racemom"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ hispmom    : chr  "N" "N" "N" "N" ...
  ..- attr(*, "label")= chr "hispmom"
  ..- attr(*, "format.stata")= chr "%9s"
 $ gained     : num  38 34 12 15 32 32 75 25 31 28 ...
  ..- attr(*, "label")= chr "gained"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ smoke      : num  0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "label")= chr "smoke"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ drink      : num  0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "label")= chr "drink"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ tounces    : num  111 116 138 136 121 117 143 113 139 120 ...
  ..- attr(*, "label")= chr "tounces"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ tgrams     : num  3147 3289 3912 3856 3430 ...
  ..- attr(*, "label")= chr "tgrams"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ low        : num  0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "label")= chr "low"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ premie     : num  0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "label")= chr "premie"
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ mom_age_cat: num  2 2 1 1 1 1 1 0 2 1 ...
  ..- attr(*, "format.stata")= chr "%10.0g"
  ..- attr(*, "labels")= Named num [1:4] 0 1 2 3
  .. ..- attr(*, "names")= chr [1:4] "Aged 15-19" "Aged 20-29" "Aged 30-39" "Aged 40+"

Trying to turn it into a factor

df$mom_age_cat <- factor(df$mom_age_cat, levels=c("15-19", "20-29", "30-39", "40+"), ordered=TRUE) # Data is imported type "character" we need to change that to factors before running statistics.
levels(df$mom_age_cat) # Confirm that our data is ordered properly.

[1] "15-19" "20-29" "30-39" "40+"  
df %>%
+   group_by(mom_age_cat) %>%
+   get_summary_stats(tgrams, type = "mean_sd")
# A tibble: 1 x 5
  mom_age_cat variable     n  mean    sd
  <ord>       <chr>    <dbl> <dbl> <dbl>
1 NA          tgrams     800 3299.  639.

ggboxplot(df, x = "mom_age_cat", y = "tgrams")

image

chelseachristie commented 3 years ago

I fixed it! I think what happened was that I was mixing up labels and levels. The 'levels' of the variable were 0,1,2,3 and the 'labels' are "15-19", "20-29", "30-39", and "40+". When I used the code below to turn the variable into a factor, it worked.

df$mom_age_cat <- factor(df$mom_age_cat,
                                  levels = c(0,1,2,3),
                                  labels = c("15-19", "20-29", "30-39", "40+"), 
                                  ordered = TRUE)

And then I could make a boxplot of birthweight stratified by maternal age category:

ggboxplot(df, x = "mom_age_cat", y = "tgrams")

image

alapo commented 3 years ago

Great job!!!

Happy to continue to help troubleshooting. Kudos for trying it out on your own dataset!

Andrew

alapo commented 3 years ago

Just revisiting this and I thought of another way to fix. This coding of 0,1,2 etc is very common when importing a file from SPSS. You can "recode" these variables within the column which can simplify your life. If you want an example of this you can upload the file on the Discussion page with Data Wrangling challenges and I'll work on it this afternoon

Andrew