JGCRI / gcamdata

The GCAM data system
https://jgcri.github.io/gcamdata/
Other
43 stars 26 forks source link

dplyr now has optional immediate error on multiple-match and unmatched-key joins #1246

Open bpbond opened 1 year ago

bpbond commented 1 year ago

FYI dplyr 1.1.0 provides a way to immediately error if a join returns more than one row from y, or if there's no match:

Multiple matches in equality joins like this one are typically unexpected (even though they are baked in to SQL) so we’ve also added a new warning to alert you when this happens. If multiple matches are expected, you can explicitly set multiple = "all" to silence this warning. This also serves as a code “sign post” for future readers of your code to let them know that this is a join that is expected to increase the number of rows in the data. If multiple matches aren’t expected, you can also set multiple = "error" to immediately halt the analysis.

https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/#inequality-joins Update: https://www.tidyverse.org/blog/2023/03/dplyr-1-1-1/

multiple: Handling of rows in x with multiple matches in y. For each row of x:
"all", the default, returns every match detected in y. This is the same behavior as SQL.
"any" returns one match detected in y, with no guarantees on which match will be returned. It is often faster than "first" and "last" if you just need to detect if there is at least one match.
"first" returns the first match detected in y.
"last" returns the last match detected in y.

unmatched: How should unmatched keys that would result in dropped rows be handled?
"drop" drops unmatched keys from the result.
"error" throws an error if unmatched keys are detected.

When gcamdata is ready to move to dplyr 1.1, this should allow for the removal of both left_join_keep_first_only and left_join_error_no_match I think?