GerkeLab / fcds

Process data from the Florida Cancer Data System
https://gerkelab.github.io/fcds/
Other
3 stars 1 forks source link

Discrepancy in counting female population when using age_adjust() for "Cervix Uteri" #94

Closed vickyliao92 closed 3 years ago

vickyliao92 commented 3 years ago

The female population count does not match the SEER population data when calculating age-adjusted rate for Cervix Uteri. When I ran

female_rate <- fcds %>% filter(age_group!="Unknown", cancer_site_group == "Cervix Uteri", year_group == "2012-2016") %>%
  count_fcds(county_name, sex) %>%
  complete_age_groups(county_name, sex, tidyr::nesting(year_group, year)) %>%
  group_drop(county_name) %>%
  age_adjust() %>%
  mutate(rate = round(rate/5, 2), n = n/5)

I received this result: image

As a comparison, I changed the cancer site to another female cancer site (e.g. "Ovary") and the result table showed a different female pop count.

female_rate <- fcds %>% filter(age_group!="Unknown", cancer_site_group == "Ovary", year_group == "2012-2016") %>%
  count_fcds(county_name, sex) %>%
  complete_age_groups(county_name, sex, tidyr::nesting(year_group, year)) %>%
  group_drop(county_name) %>%
  age_adjust() %>%
  mutate(rate = round(rate/5, 2), n = n/5)

image

I also checked the 2014 female pop using the SEER population data and got 10169610 (same as what's shown in the second results table). Over time I noticed that the female pop count matched the SEER data when I used other female cancer sites but did not match when I used Cervix Uteri.

gadenbuie commented 3 years ago
library(tidyverse)
library(fcds)

fcds <- fcds_load()

The change in population means that a group (in this case a county) doesn't have any cases. In this case, no cases of Cervix Uteri were registered in two counties for 2012-2016:

fcds %>%
  filter(
    cancer_site_group %in% c("Cervix Uteri", "Ovary"),
    year_group == "2012-2016"
  ) %>% 
  distinct(cancer_site_group, county_name) %>% 
  count(cancer_site_group)
#> # A tibble: 2 x 2
#>   cancer_site_group     n
#>   <fct>             <int>
#> 1 Cervix Uteri         65
#> 2 Ovary                67

Sidenote: note that there are 67 counties in Florida (plus one Unknown county in the FCDS data).

fcds:::usaboundaries_counties_fl$name %>% length()
#> [1] 67

The recommended approach is to use the discard_unseen_levels options in count_fcds() to ensure that all counties are kept even if no cases are registered for the demographic or time period. This argument keeps all factor levels of the variables that are being counted, rather than silently dropping unseen levels. Then, when you pass to complete_age_groups() the "missing" age groups are added. They're not actually missing, because they really have a count of 0.

fcds %>%
  filter(
    cancer_site_group %in% c("Cervix Uteri", "Ovary"),
    year_group == "2012-2016"
  ) %>% 
  count_fcds(
    cancer_site_group, 
    county_name = TRUE, 
    sex = TRUE, 
    discard_unseen_levels = FALSE
  ) %>%
  complete_age_groups(
    county_name, sex, tidyr::nesting(year_group, year),
    cancer_site_group = c("Cervix Uteri", "Ovary")
  ) %>%
  ungroup() %>% 
  distinct(cancer_site_group, county_name) %>% 
  count(cancer_site_group)
#> # A tibble: 2 x 2
#>   cancer_site_group     n
#>   <chr>             <int>
#> 1 Cervix Uteri         68
#> 2 Ovary                68

This all is a bit complicated, but the goal is to be able to flexibly tell the difference between groups that are missing because they shouldn't be included in a count and groups that are missing because they should be counted as zero. The biggest thing to keep in mind is that all groups that define the exposure group should be present in the data when you get to age_adjust().

fcds %>%
  filter(
    age_group != "Unknown",
    cancer_site_group == "Cervix Uteri",
    year_group == "2012-2016"
  ) %>%
  count_fcds(county_name = TRUE, sex = TRUE, discard_unseen_levels = FALSE) %>%
  complete_age_groups(county_name, sex, tidyr::nesting(year_group, year)) %>%
  group_drop(county_name) %>%
  age_adjust() %>%
  mutate(rate = round(rate / 5, 2), n = n / 5)
#> # A tibble: 3 x 6
#> # Groups:   sex, year_group, year [3]
#>   sex     year_group year      n population   rate
#>   <fct>   <fct>      <chr> <dbl>      <dbl>  <dbl>
#> 1 Male    2012-2016  2014    0      9728137   0   
#> 2 Female  2012-2016  2014  947.    10169610   8.64
#> 3 Unknown 2012-2016  2014    0.2          0 NaN
fcds %>%
  filter(
    age_group != "Unknown",
    cancer_site_group == "Cervix Uteri",
    year_group == "2012-2016"
  ) %>%
  count_fcds(county_name = TRUE, sex = TRUE, discard_unseen_levels = FALSE) %>%
  complete_age_groups(county_name, sex, tidyr::nesting(year_group, year)) %>%
  group_drop(county_name) %>%
  age_adjust() %>%
  mutate(rate = round(rate / 5, 2), n = n / 5)
#> # A tibble: 3 x 6
#> # Groups:   sex, year_group, year [3]
#>   sex     year_group year      n population   rate
#>   <fct>   <fct>      <chr> <dbl>      <dbl>  <dbl>
#> 1 Male    2012-2016  2014    0      9728137   0   
#> 2 Female  2012-2016  2014  947.    10169610   8.64
#> 3 Unknown 2012-2016  2014    0.2          0 NaN