Closed gwenbeebe closed 3 years ago
Note: this also applies to pe_hohs_entered
To keep my deduplicating thoughts together, I'm going to expand this question to "and when?"
We currently create pe_hohs_served_leavers
by adding an exit filter to pe_hohs_served
. This makes sense, but we are filtering pe_hohs_served
, deduplicating, and then filtering for pe_hohs_served_leavers
after that deduplication.
If we instead create pe_hohs_served_leavers
all at once, as shown below, we gain a few more people because we apply those filters before the deduplication. Is that what we want to do, or do we want to be catching those additional leavers? I'm not sure what is closest to the intent of the metric.
pe_hohs_served_leavers <- co_hohs_served %>%
filter(served_between(., hc_project_eval_start, hc_project_eval_end) &
exited_between(., hc_project_eval_start, hc_project_eval_end)) %>%
select("PersonalID", "ProjectID", "EnrollmentID") %>%
inner_join(pe_coc_funded, by = "ProjectID") %>%
left_join(Client, by = "PersonalID") %>%
left_join(
Enrollment %>%
select(-UserID,-DateCreated,-DateUpdated,-DateDeleted,-ExportID),
by = c(
"PersonalID",
"EnrollmentID",
"ProjectID",
"ProjectType",
"ProjectName"
)
) %>%
select(all_of(vars_we_want)) %>%
arrange(PersonalID, AltProjectID, desc(EntryDate)) %>%
distinct(PersonalID, AltProjectName, .keep_all = TRUE) # no dupes w/in a project
Just checking back in on this--it looks like pe_hohs_served_leavers
and pe_hohs_served
both have de-duping logic in them, but it's commented out. Are we testing something with that commenting? It looks like with that commented we aren't deduping them before creating the summaries, so I feel like I'm missing something.
Closed per conversation with Genelle comment in Slack on 5/2
Right now, most of our
pe_
cohorts are getting deduplicated by person and alternate project ID. However,pe_adults_entered
contains a few duplicates (my data files aren't from today, but I'm showing five duplicates). If we deduplicate it, we see different results in oursummary_pe_adults_entered
dataframe because our summarizing is done withn()
instead of distinct counts. This leaves me with two questions:n()
as a de-duping failsafe, or does it make more sense to keep going by row counts?