Iterative subset crashes

heatherbooker commented 5 years ago

I tried out python iterative_subset.py since python direct_subset.py wasn't working with my setup, and I'm getting the following error:

Traceback (most recent call last):
  File "iterative_subset.py", line 11, in <module>
    all_tables = list_all_tables(source_dbc.get_db_connection())
NameError: name 'source_dbc' is not defined

I'm wondering if it needs this (and surrounding) setup code which exists in the direct_subset script?

https://github.com/TonicAI/condenser/blob/30e12bd9aea4a083306e57f250b1bf91f61c0287/direct_subset.py#L12

heatherbooker commented 5 years ago

Update: if I move the two all_tables lines to after the source_dbc exists, it gets mad if any field is missing from the config file (which it didn't do with the direct subset script). By gets mad I mean

Traceback (most recent call last):
  File "iterative_subset.py", line 66, in <module>
    lower_limit, lower_limit_norm, upper_limit, upper_limit_norm = compute_fast_limits()
  File "iterative_subset.py", line 40, in compute_fast_limits
    desired_result = config_reader.get_target_percent()
  File "/Users/heather.booker/dev/condenser/config_reader.py", line 27, in get_target_percent
    return _config['desired_result']['percent']
KeyError: 'desired_result'

And similar with excluded_tables.

Maybe this is intended and I would have experienced it later if the direct_subset script had gotten further? I'm not sure. ><

acolombi commented 5 years ago

I just pushed a bunch of updates. Maybe give it another whirl after pulling the latest? Of note: We deprecated iterative_subset.py. I think our best bet is to debug your issue with direct_subset.py.

So what's the problem you're seeing with direct_subset.py? If you can send a copy of your schema, that might help us diagnose more quickly. The easiest way to do that is with pg_dump, https://www.postgresql.org/docs/9.6/app-pgdump.html. The command you would use looks like,

$ pg_dump -h <host> -U <user> -p <port> -d <database> --schema-only -f <output_file>.sql

If you don't want to paste the schema here I can give you an email address to send it too. Thanks.

heatherbooker commented 5 years ago

Ah ok cool. I think the reason it wasn't working for me was that I failed to read the "Known Issues" section with

Only works with bigint, non-compound primary and foreign keys.

:)

Any plans in the future for that to change?

acolombi commented 5 years ago

Is it the compound part of the int part that's a challenge for you? (or, I guess, both?)

heatherbooker commented 5 years ago

Probably both but definitely at least the compound part!

heatherbooker commented 5 years ago

But should I be able to run it ignoring the tables involving those keys? Like

# config.json
{
    "passthrough_threshold": 50,
    "excluded_tables": [
        {"schema": "public","tables": [ "nos", "nopes", "no_nopes"] }
    ],
    "passthrough_tables": [
        {"schema": "public","tables": [ "nos", "nopes", "no_nopes"] }
    ],
    "desired_result": { "percent": 5, "schema": "public", "table": "potatoes" },
 }

With the above, I expect it to ignore nos nopes and no_nopes, but it still chokes on them with Exception: Creating tables failed., and in the create_error.txt:

psql:/me/condenser/SQL/dump_create.sql:5788:
ERROR: column "nopes.none_of_this" must appear in the GROUP BY clause
       or be used in an aggregate function
LINE 3:     nopes.none_of_this,

acolombi commented 5 years ago

Yes, excluded_tables should ignore the tables, if possible. It might be the case that the tables with compound keys are required, for example if you subset on table user, which has a FK to address, and address using a compound PK. In that scenario you can't take any of the user table without also subsetting address, and excluding the table would fail. (Though one way you can handle that is to specify that relationship in the dependency_breaks, e.g. [{"parent": "public.user", "child": "public.addess"}].

I just pushed some changes that might address the issue you're seeing. It's hard to know without a better idea of what's in your schema. Any chance you can share it?

Otherwise my guess is that it has something to do with using a VIEW that contains a GROUP BY in your schema. This was an issue we addressed in our commercial offering so I just ported those changes over.

acolombi commented 5 years ago

Hi @heatherbooker , have you had a chance to check out the new version? Thanks.

heatherbooker commented 5 years ago

Hi @acolombi , sorry for the hold up. Unfortunately I can't share the schema, sorry! If I try the new version, it seems that the config format changed to accept target but the readme and example.config.json did not, just a heads up!

If I combine the schema and table keys in my config file with target, it gets farther now! It hits

Beginning subsetting with these direct targets:
...
Beginning greedy downstream subsetting with these tables:
...

And then gets trapped on

Traceback (most recent call last):
  File "/me/condenser/subset.py", line 121, in __subset_greedily
    database_helper.run_query('INSERT INTO "{}"."{}" {}'.format(schema_name(target), table_name(target), query), destination_conn)
  File "/me/condenser/database_helper.py", line 87, in run_query
    cur.execute(query)
psycopg2.ProgrammingError: operator does not exist: integer = character varying
LINE 1: ...long_table_name"."foreign_key_type_integer" IN (SELECT...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "direct_subset.py", line 30, in <module>
    s.run_middle_out()
  File "/me/condenser/subset.py", line 81, in run_middle_out
    self.__subset_greedily(t, processed_tables, relationships)
  File "/me/condenser/subset.py", line 126, in __subset_greedily
    database_helper.run_query('DROP TABLE IF EXISTS "{}"."{}"'.format(self.temp_schema, temp_target_name), destination_conn)
  File "/me/condenser/database_helper.py", line 87, in run_query
    cur.execute(query)

In any case, it's very possible that there are some dependencies that I should be breaking but am not, but I'm probably not going to end up going through and finding them. I'm not sure if the above is helpful to you but I really appreciate the help and fixes you've been providing through this!

acolombi commented 5 years ago

Ah yes, thanks for the feedback. The README and example have been updated.

It should give a pretty clear error message when there are unbroken circular dependencies. It'll say something like:

$ python direct_subset.py 
Traceback (most recent call last):
  File "direct_subset.py", line 30, in <module>
    s.run_middle_out()
  File "./condenser/subset.py", line 59, in run_middle_out
    order = get_topological_order_by_tables(relationships, connected_tables)
  File "./condenser/topo_orderer.py", line 6, in get_topological_order_by_tables
    return list(toposort(topsort_input))
  File "~/.pyenv/versions/p35/lib/python3.5/site-packages/toposort.py", line 81, in toposort
    raise CircularDependencyError(data)
toposort.CircularDependencyError: Circular dependencies exist among these items: {'public.events':{'public.users'}, 'public.ignore_me':{'public.users'}, 'public.places':{'sub.regions'}, 'public.users':{'sub.regions'}, 'sub.regions':{'public.users'}}

So I think the error is not a circular dependency you're forgetting to break...

Honestly tho I don't know what it is. Tracing through the code I'm not sure what would cause an error like the one you're experiencing. Is there any chance you can share just the schema surrounding the table and foreign key identified in the exception you got?

Another option is you could try our commercial product, which also features subsetting (among other abilities, including data masking and data synthesis), but has a more robust implementation. (This Python code is mostly a playground for trying ideas out.)

Thanks.

heatherbooker commented 5 years ago

Ah yeah, I was thinking about the dependencies in terms of excluding tables with compound keys in terms of what you mentioned earlier

It might be the case that the tables with compound keys are required, for example if you subset on table user, which has a FK to address, and address using a compound PK. In that scenario you can't take any of the user table without also subsetting address, and excluding the table would fail. (Though one way you can handle that is to specify that relationship in the dependency_breaks, e.g. [{"parent": "public.user", "child": "public.addess"}].

Actually now that I think about it, for this line in the output, the first part isn't actually a table name - it seems to be something prepended to it, ie the table is called long_table_name but it shows up as ...mp_long_table_name:

psycopg2.ProgrammingError: operator does not exist: integer = character varying
LINE 1: ...mp_long_table_name"."foreign_key_type_integer" IN (SELECT...

As for the schema, long_table_name does have an index (not the foreign_key_type_integer shown in the error output) UNIQUE CONSTRAINT, btree (foreign_key_type_integer, other_fk_to_different_table_also_type_integer) - so maybe that has to do with what it's catching on?

heatherbooker commented 5 years ago

Anyway I'm gonna close since I don't want either of us to sink too much time into this, but thanks so much for your help!!

TonicAI / condenser

Iterative subset crashes #9