Which `pe_` cohorts should be deduplicated? (and when?)

gwenbeebe commented 3 years ago

Right now, most of our pe_ cohorts are getting deduplicated by person and alternate project ID. However, pe_adults_entered contains a few duplicates (my data files aren't from today, but I'm showing five duplicates). If we deduplicate it, we see different results in our summary_pe_adults_entered dataframe because our summarizing is done with n() instead of distinct counts. This leaves me with two questions:

Should this be deduplicated, and if so, should we keep the first or most recent entry?
Are you interested in switching our aggregations to distinct counts instead of using n() as a de-duping failsafe, or does it make more sense to keep going by row counts?

gwenbeebe commented 3 years ago

Note: this also applies to pe_hohs_entered

gwenbeebe commented 3 years ago

To keep my deduplicating thoughts together, I'm going to expand this question to "and when?"

We currently create pe_hohs_served_leavers by adding an exit filter to pe_hohs_served. This makes sense, but we are filtering pe_hohs_served, deduplicating, and then filtering for pe_hohs_served_leavers after that deduplication.

If we instead create pe_hohs_served_leavers all at once, as shown below, we gain a few more people because we apply those filters before the deduplication. Is that what we want to do, or do we want to be catching those additional leavers? I'm not sure what is closest to the intent of the metric.

pe_hohs_served_leavers <- co_hohs_served %>%
  filter(served_between(., hc_project_eval_start, hc_project_eval_end) &
           exited_between(., hc_project_eval_start, hc_project_eval_end)) %>%
  select("PersonalID", "ProjectID", "EnrollmentID") %>%
  inner_join(pe_coc_funded, by = "ProjectID") %>%
  left_join(Client, by = "PersonalID") %>%
  left_join(
    Enrollment %>%
      select(-UserID,-DateCreated,-DateUpdated,-DateDeleted,-ExportID),
    by = c(
      "PersonalID",
      "EnrollmentID",
      "ProjectID",
      "ProjectType",
      "ProjectName"
    )
  ) %>%
  select(all_of(vars_we_want)) %>%
  arrange(PersonalID, AltProjectID, desc(EntryDate)) %>%
  distinct(PersonalID, AltProjectName, .keep_all = TRUE) # no dupes w/in a project

gwenbeebe commented 3 years ago

Just checking back in on this--it looks like pe_hohs_served_leavers and pe_hohs_served both have de-duping logic in them, but it's commented out. Are we testing something with that commenting? It looks like with that commented we aren't deduping them before creating the summaries, so I feel like I'm missing something.

gwenbeebe commented 3 years ago

Closed per conversation with Genelle comment in Slack on 5/2

COHHIO / COHHIO_HMIS

Which `pe_` cohorts should be deduplicated? (and when?) #151