COHHIO / COHHIO_HMIS

Code for pulling in HMIS data, writing it out to reports
GNU Affero General Public License v3.0
10 stars 5 forks source link

Which `pe_` cohorts should be deduplicated? (and when?) #151

Closed gwenbeebe closed 3 years ago

gwenbeebe commented 3 years ago

Right now, most of our pe_ cohorts are getting deduplicated by person and alternate project ID. However, pe_adults_entered contains a few duplicates (my data files aren't from today, but I'm showing five duplicates). If we deduplicate it, we see different results in our summary_pe_adults_entered dataframe because our summarizing is done with n() instead of distinct counts. This leaves me with two questions:

gwenbeebe commented 3 years ago

Note: this also applies to pe_hohs_entered

gwenbeebe commented 3 years ago

To keep my deduplicating thoughts together, I'm going to expand this question to "and when?"

We currently create pe_hohs_served_leavers by adding an exit filter to pe_hohs_served. This makes sense, but we are filtering pe_hohs_served, deduplicating, and then filtering for pe_hohs_served_leavers after that deduplication.

If we instead create pe_hohs_served_leavers all at once, as shown below, we gain a few more people because we apply those filters before the deduplication. Is that what we want to do, or do we want to be catching those additional leavers? I'm not sure what is closest to the intent of the metric.

pe_hohs_served_leavers <- co_hohs_served %>%
  filter(served_between(., hc_project_eval_start, hc_project_eval_end) &
           exited_between(., hc_project_eval_start, hc_project_eval_end)) %>%
  select("PersonalID", "ProjectID", "EnrollmentID") %>%
  inner_join(pe_coc_funded, by = "ProjectID") %>%
  left_join(Client, by = "PersonalID") %>%
  left_join(
    Enrollment %>%
      select(-UserID,-DateCreated,-DateUpdated,-DateDeleted,-ExportID),
    by = c(
      "PersonalID",
      "EnrollmentID",
      "ProjectID",
      "ProjectType",
      "ProjectName"
    )
  ) %>%
  select(all_of(vars_we_want)) %>%
  arrange(PersonalID, AltProjectID, desc(EntryDate)) %>%
  distinct(PersonalID, AltProjectName, .keep_all = TRUE) # no dupes w/in a project
gwenbeebe commented 3 years ago

Just checking back in on this--it looks like pe_hohs_served_leavers and pe_hohs_served both have de-duping logic in them, but it's commented out. Are we testing something with that commenting? It looks like with that commented we aren't deduping them before creating the summaries, so I feel like I'm missing something.

gwenbeebe commented 3 years ago

Closed per conversation with Genelle comment in Slack on 5/2