davidski commented 5 years ago

Is it possible, perhaps outside of the dm ecosystem, to model relationships between tables that aren't based on FK-PK relationships. Think of a situation where there is a flights and an airports table, where some airport codes may not exist in the flights table (or the other way around). These would definitely be messy data relationship, but do occur fairly frequently in the data I work with. My current approach is to write lots of dplyr join statements (and try to remember the relationships) or to build views in a lightweight system like sqlite and query the views. It would be nice to be able to codify these relationships directly within R.

Thanks for the time and the package!

krlmlr commented 5 years ago

Thanks for your feedback!

We can model weak constraints as part of a dm -- the caller needs to use cdm_add_fk(check = FALSE) when adding (and this is the default as of b23b969d). When copying a dm to a database, I haven't found a good way to model this relationship so that it can be learned again with cdm_learn_from_db(), but these weak constraints could also be learned from the data if needed.

56 is about showing, at a glance, which keys satisfy data integrity constraints.

Are you suggesting to draw weak constraints differently in the diagram?

krlmlr commented 3 years ago

I think the next step is to implement dm_repair_constraints():

create dummy dimensions
alternatively: remove dangling facts
remove duplicate primary entries, perhaps rule-based by ordering in a column
optionally create a dm that contains the missing/added parts

moodymudskipper commented 3 years ago

Please let me know if I understand this correctly.

The following example shows unsatisfied constraints :

dm_nycflights13() %>%
  dm_examine_constraints() 
#> ! Unsatisfied constraints:
#> * Table `flights`: foreign key tailnum into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (6), N537MQ (5), N722MQ (5), N730MQ (5), N736MQ (5), ...

In this case dm_repair_constraints() would either (depending on a parameter) :

add rows to parent tables, updating the primary keys, filling other columns with NA (default)
remove rows from child tables where a primary key is not found

Another issue that might come up is duplicate primary entries.

We might optionally return a new dm that contains only data related to missing/added parts, rather than the full dm (default)

If I understand the above correctly we might modify dm_examine_constraints() so it contains one more column with the structured info that was formatted into the problem column, and it would be called by dm_repair_constraints(), would this work ?

krlmlr commented 3 years ago

Good idea, I think this could compose nicely with dm_rows_*(). The return for dm_examine_constraints() could gain a new column that contains a "repair plan": a named list of dm objects:

plan <- structure(
  list(
    insert = dm(...),
    update = dm(...),
    delete = dm(...),
    ...
  ),
  class = "dm_repair_plan"
)

dm_apply_repair_plan <- function(dm, plan, in_place = NA, progress = NA) {
  # Emit message only once (also fix in the current implementation?)
  progress <- check_progress(progress)

  out <- dm
  if (!is.null(plan$insert)) {
    out <- dm_rows_insert(dm, plan$insert, in_place = in_place, progress = progress)
  }
  if (!is.null(plan$update)) {
    out <- dm_rows_update(dm, plan$update, in_place = in_place, progress = progress)
  }
  ...
  out # with visibility depending on in_place
}

One thing that's going to be a bit difficult is the disambiguation of duplicate primary keys. Maybe this has to be a two-step process: examine + clean PK, examine + clean FK.

Let's start with FK constraints first, and see how it feels.

moodymudskipper commented 3 years ago

I have issues with dm_rows_delete(), it seems it can only remove based on a primary key and only on the table that contains this primary key.

In my use case using dm_nycflights13() I need to be able to delete rows from the flights table, but it has no primary key. I was expecting that dm_rows_delete() would either:

Work without primary key, by doing an anti_join between the common tables of both dm arguments.
Propose an option, if not the default, to remove rows in fact tables where foreign keys match removed values from dimension table.

Do you have any pointer ? I might be missing something.

moodymudskipper commented 3 years ago

As far as I can tell, check_pk checks only if there are duplicate or missing primary keys, this can never happen if the source is remote right ? So for this part I can assume that the source is local and use functions that are not translated into SQL such as rle ?

moodymudskipper commented 3 years ago

Re : Emit message only once

Is there a pattern you're already using for this? I would create an environment object globals in the package containing an item warn_progress set to FALSE, and when displaying the message the first time it would be toggled to TRUE, would this work ?

krlmlr commented 3 years ago

Maybe (in a separate PR):

library(dm)

dm <-
  dm_nycflights13() %>%
  dm_add_pk(flights, c(year, month, day, carrier, flight), check = TRUE)

dm %>%
  dm_draw()

^{Created on 2021-05-28 by the reprex package (v2.0.0)}

Otherwise, the canonical solution in the database world seems to be that we assume that the primary key consists of all columns.

There can be duplicate PK on the database if the constraint is defined only locally in the dm. Most operations in dm are backend-agnostic, with very few exceptions.

To avoid repeated messages, we could do once:

progress <- check_progress(..., progress = progress)

and then pass the resulting progress to the child functions. check_progress() should be fast if the input is TRUE/FALSE and never return NA . The same function could also be used by child functions.

library(rlang)
library(progress)
bench::mark(is_installed("progress"))
#> # A tibble: 1 x 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 is_installed("progress")   7.61µs   9.78µs    84544.        0B     16.9

^{Created on 2021-05-28 by the reprex package (v2.0.0)}

To make it even faster, check_progress() could return structure(TRUE, checked = TRUE) or some other sentinel to avoid calling is_installed() repeatedly. Do we ever need nested progress bars?

moodymudskipper commented 3 years ago

Regarding the repeated error message for progress I thought you meant once per session, but I see now that you meant once per function call, got it now!

krlmlr commented 1 year ago

The now closed PR seemed to be one of the next actions back then.

cynkra / dm

`dm_repair_constraints()` (OP: Supporting relationships with weaker constraints) #4

56 is about showing, at a glance, which keys satisfy data integrity constraints.