Filters for large databases

reaching-for-libra commented 10 months ago

I'm working with a very large database (oracle, in this case), for which i want to pull only a few tables. After setting filters in the .toml file and trying to generate the library, I'm finding that it is taking far too long.

I wonder if the logic for filtering can be partially applied earlier, when the GetSchema() methods are called. The GetSchema() method has an overload for specifying exclusions that might have some performance gains here.

Here's a link to the oracle docs, but I believe that it's the same idea in each of the providers sqlhydra is using.

https://docs.oracle.com/cd/E85694_01/ODPNT/ConnectionGetSchema3.htm

JordanMarr commented 10 months ago

How long does it take? How many tables do you have?

It looks like the current filtering could be more efficient because it is loading all the columns for tables before the tables are filtered.

reaching-for-libra commented 10 months ago

I checked some of the unfiltered results for the GetSchema calls:

Tables -> ~ 50,000 (returned this data quickly) Views -> ~ 40,000 (returned quickly) Columns -> ~ 4,000,000 (returned after about 45 minutes)

I kicked off the sql hydra tool and had to kill it after 3 hours.

Practically speaking, I think I'll just need to ask the db admins to create a user that can only access the tables that I'm interested in. Not ideal for dev, in my case, but not unreasonable.

~BTW - I'm finding now that the I misunderstood how filters work - in my case, i just wanted 2 tables in scope. I added those to the Include section of the config, but I guess there's not a way in the exclude to exclude everything except those two. That would be why the file generation was especially long in my case. Am I overlooking a simple way to do it without adding a bunch of individual schema/table names/wildcards?~

Edit: Disregard that last part, I incorectly used a "." instead of "/" between schema and table name

JordanMarr commented 10 months ago

It looks like GetSchema for all the db providers accepts a restrictionValues: string array argument. For Oracle, it looks like: restrictions[0] = schema/owner restrictions restrictions[1] = table restrictions restrictions[2] = column restrictions

I would assume that these would accept multiple args, but the example doesn't show what the delimeter is. I'm also not sure if all the providers use the same format.

I don't think the current [filters] include / exclude in the TOML file can be repurposed for this use case. The TOML config would probably need to be extended with a new section. Maybe something like:

[schema_restrictions]
schema = [ "dbo" ]
tables = [ "Customers", "Orders" ]
columns = [ ]

There's still a lot of ambiguity though.

However, I will at least move the table filters to happen earlier so that the columns are not processed on tables that should be filtered out. That way, even if it takes 45m to pull down all the columns, it should still be able to process them quickly (based the filtered set of tables).

reaching-for-libra commented 10 months ago

That sounds like an improvement for sure.

Regarding the restrictions arguments - when I played with it last week, I found that each kind of schema ("Tables", "Columns", etc) has different parameter sets - but none of them are wildcard based, so it would have to be something done iteratively. It's also different per provider (I only looked at Sql Server and Oracle, but they had different parameter sets).

For a giant database, it's probably not unreasonable to just have a concrete list without wildcards provided, and could be a big savings if collected/filtered at the beginning. However, for smaller databases, it's probably not worth the effort.

JordanMarr commented 10 months ago

Perhaps a better implementation would be to allow the user to configure a simple array of restrictions. The config wouldn’t try to prompt for tables, schemas, etc. That way the user would be responsible for passing in the proper restrictions per the provider they are using.

reaching-for-libra commented 10 months ago

Sounds reasonable - let a user do things the current way, or else opt in to the advanced filtering if they need it

JordanMarr commented 10 months ago

I started to implement this and then realized that the GetSchema restrictions do not allow more than one "Tables", "Views" or "Columns" filter. I feel that this may not be enough benefit to warrant the added implementation complexity. (Although, it could arguably be useful for the purpose of filtering for a specific schema.) However, I did optimize the tables to filter before the columns are calculated, so that will at least provide some optimization for you.

JordanMarr commented 10 months ago

The more I think about it, it's probably worth having this feature just to be able to pre-filter by schema / owner.

reaching-for-libra commented 10 months ago

Thanks for giving this some attention, I'll check it out when I find the time!

JordanMarr commented 10 months ago

These changes are now available in v2.1.0. https://github.com/JordanMarr/SqlHydra/releases/tag/v2.1.0

Maybe you can take advantage of the new restrictions to filter to a specific owner. If not, then hopefully the table filtering optimization will help.

reaching-for-libra commented 10 months ago

Hey, this was a great improvement!

You noted in the config docs the limitations. Just a thought - it might be worth considering expanding the change to make the restrictions configuration a collection. In the schema providers, the calls to GetSchema(..) can happen for each record in collection, instead of just once. sTables / sViews / sColumns would be the collected results

Hopefully that makes sense. This would allow for multiple tables, different schemas.

JordanMarr commented 10 months ago

I did consider that as an option. However, I don't think there are enough users with schemas large enough to require this workaround; it's hard for me to justify the additional cyclomatic complexity it would impose. OTOH, adding basic support for restrictions was pretty easy with little to no downsides (even if no one ends up using it -- which I think is likely).

JordanMarr / SqlHydra

Filters for large databases #62

Other solutions: