TonicAI / condenser

Condenser is a database subsetting tool
https://www.tonic.ai
MIT License
312 stars 48 forks source link

subset_downstream generates incorrect SELECT queries if target table has multiple fk references #24

Closed mjvotaw closed 2 weeks ago

mjvotaw commented 2 years ago

I noticed this just now while testing this tool against a client's database. They have several tables that each contain 2 columns that are referenced in foreign key relationships. Consider the following tables:

CREATE TABLE IF NOT EXISTS public."user"
(
    id integer NOT NULL,
    other_key uuid NOT NULL,
    CONSTRAINT user_pkey PRIMARY KEY (id),
    CONSTRAINT user_unique_key UNIQUE (other_key)
)

CREATE TABLE IF NOT EXISTS public.some_table
(
    id integer NOT NULL,
    user_id integer,
    CONSTRAINT some_entry_pkey PRIMARY KEY (id),
    CONSTRAINT "some_entry-user-pk" FOREIGN KEY (user_id)
        REFERENCES public."user" (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

CREATE TABLE IF NOT EXISTS public.another_table
(
    id integer NOT NULL,
    user_uuid uuid,
    CONSTRAINT another_table_pkey PRIMARY KEY (id),
    CONSTRAINT another_table_user_uuid_fkey FOREIGN KEY (user_uuid)
        REFERENCES public."user" (other_key) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

subset_downstream generates the following query:

SELECT "user_id" FROM "public"."some_table" WHERE ("user_id") NOT IN (SELECT "other_key" FROM "public"."user")

Which then throws the following error:

psycopg2.errors.UndefinedFunction: operator does not exist: integer = uuid
LINE 1: ..._id" FROM "public"."some_table" WHERE ("user_id") NOT IN (SE...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

It looks like the issue stems from the fact that pk_columns is only set once, instead of being set for each object in referencing_tables. This might be a naive solution, but simply adding pk_columns = r['target_columns'] before the query is built results in the data being properly copied to the destination db.

theaeolianmachine commented 2 years ago

Thanks for sending this our way — it's actually a little bit more complicated than however. Notably, you could have a world in which different foreign key constraints have different ways of uniquely identifying a row in a target table. For example:

Unfortunately condenser is not built for these more complicated schemas, unlike Tonic's built-in subsetter which handles your case, the above case, and an additional variety of others. We'd happily look at a pull request however!

keenanwells-tatari commented 1 year ago

Been assessing the tool for use with our stack and also ran into this issue. Just wondering if this issue also occurs with the premium database subsetter? Thanks!

bricct commented 1 year ago

Yes! The premium database subsetter does indeed handle this case, as well as many other more complicated situations with compound keys, cycles, etc.

If you're interested in exploring our premium tool, you can test it out in a free trial at this link.