dolthub / dolt

Dolt – Git for Data
https://www.dolthub.com
Apache License 2.0
18.01k stars 516 forks source link

Foreign key on different database #3022

Open julienperret opened 2 years ago

julienperret commented 2 years ago

The following does not work:

Using dolt sql:

CREATE DATABASE parent_schema;
CREATE TABLE parent_schema.parent(
  id INT,
  PRIMARY KEY(id)
);

CREATE DATABASE child_schema;
CREATE TABLE child_schema.child(
  id INT,
  parent_id INT,
  PRIMARY KEY(id),
  CONSTRAINT parent_ref FOREIGN KEY(parent_id) REFERENCES parent_schema.parent(id)
);

It produces the error:

table not found: parent

But still creates a table without the foreign key:

describe child_schema.child;

Produces:

+-----------+------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+------+------+-----+---------+-------+ | id | int | NO | PRI | | | | parent_id | int | YES | | | | +-----------+------+------+-----+---------+-------+

Am I missing something?

Tried with dolt version 0.36.0 and 0.37.6.

PS: Yes, my example is very much inspired by #1104

timsehn commented 2 years ago

@Hydrocharged committed to have this feature supported in foreign keys v2 delivered by end of month.

Hydrocharged commented 2 years ago

I spoke to the team and it looks like I made an error in judgement.

The SQL engine that Dolt uses is go-mysql-server(GMS), and I'm working on a full foreign key rewrite that will bring cross database support for foreign keys. Dolt, as an integrator of GMS, will syntactically support cross-database foreign keys, however how Dolt manages multiple databases within a server context is at odds with the referential integrity provided by foreign keys.

In simple terms, Dolt stores each database as its own repository, and creating a foreign key across repositories will cause issues with many of our versioning features as many of them do not support multiple databases. Also, due to their separate commit histories, there is no way to enforce referential integrity when changing to a different branch, reverting a commit, etc. So although there will be a "technical" form of support in Dolt, it may be in our best interest to disable cross database support for foreign keys.

If you do not require that each database belong to its own repository, then we may be able to have a special database statement that essentially creates a new namespace within the same commit history, but that's something I'd have to discuss with the team (assuming it is something you would want).

zachmu commented 2 years ago

There are a couple ways we could move forward with this requirement, but both of them require more work than Daylon originally thought.

1) Support multiple DBs in the same commit history. This is on our road map here: https://github.com/dolthub/dolt/issues/3043 2) Defer cross-DB foreign key enforcement unless all necessary DBs are present in a local checkout. We would need to think about how this works with remotes, when to do the enforcement during clones / pulls, etc. The problem is that many dolt command line commands (e.g. dolt pull) currently operate on a single database, and in order to enforce cross-DB foreign key constraints we would need to expand these operations to include many more databases, figure out what to do if a violation were introduced, etc. 3) Create cross-db foreign key constraints, but do not enforce them. They would be informational only.

If you could tell us a little more about your use case, it would help us plan and decide how to tackle this request.

Thanks,

Zach

julienperret commented 2 years ago

The use case in an application that allows to edit multiple independant datasets. The datasets being independant, we would rather keep the commit graphs independant.

The need for a foreign key is to have a parent/"common" dabatase (with its own commit graph) is simply to have an overview of all datasets and a shared primary key for datasets. It might not be the best way to do this with dolt though...

We are open to suggestions!

Thanks, Julien

zachmu commented 2 years ago

Thanks for getting back to me. I can think of a couple ways forward.

Depending on the size of the common database and how often you expect it to change, a reasonable solution might be to just embed it in each child database, which would mean that the foreign keys are all within the same database. Then ship updates to the children periodically by dropping and re-creating the copy using the data from the parent main.

If that doesn't work for you, a solution relatively near to hand would be to support cross-db foreign keys as informational only during database operations, and to implement an out-of-band way to check the consistency if all dbs are present. We have similar functionality already with the dolt constraints verify command, which is used to verify if foreign key constraints are actually valid in the database (since it's possible to disable them during a session).

The other way forward is to put everything in the same database / commit graph so the constraints are self-contained. This again depends on the size of your data and what workflow you are optimizing for that makes multiple DBs attractive to begin with. Some ideas of how that could work:

1) Have a branch or a fork called main that contains the common table data, and no other data 2) Have a branch or a fork for each child database, based off the parent 3) When the common DB changes, merge those changes into each child

Whether to use forks or branches depends on the size of the and how you intend to structure your application. With branch, every client gets all the data when they clone or pull. With forks, they get only the data on that fork when they clone or pull. Either solution lets people manage multiple branches based off whatever they consider their base branch.

Let me know what else I can tell you. Also feel free to bug me more on discord. Let's figure out how to get you up and running.