Open moodymudskipper opened 1 year ago
We have dm::enum_pk_candidates()
, but this only looks at simple keys.
A more efficient strategy for compound keys might be to look at the columns from the left (first column, then the first two, etc..)
More efficient assuming some common practice but less robust, we'll might get a 5 variables compound key where variables 3 and 5 would suffice. Or maybe we do a first pass as you suggest, we find our 5 variable compound key, and do a second pass trying to remove variables one by one, maybe starting from before last and going back to the first ? and these would eliminate in turn variables 4, 2, and 1 in this case. This will find a single solution but that seems a reasonable compromise. The function might do the second pass optionally.
I had this use case and wondered if it'd be relevant as a {dm} utility
Given an unknown data frame, what columns could form a primary key ?
I have an implementation here for local data frames but this could be adapted. We test combinations of columns starting with columns that have the most distinct values, and dismiss right away irrelevant combinations (e.g. 2 columns with 2 distinct values each cannot be a PK for a dataset of 10 rows because 2*2 < 10). We try first with 1 col, then 2, then 3, and stop there by default. We can choose to return early, as soon as we find enough candidates, or to give all possible candidates for the minimum required n of columns. There's a progress bar, which might be improved a bit.
Created on 2023-03-01 with reprex v2.0.2