Closed nj1973 closed 1 month ago
It looks like we make a fresh connection per schema which could equate to lot of connections in some systems. If we provide a list of allowed schemas then this is only applied to the source and we still get a per schema connection for all schemas on the target.
source_table_map = get_table_map(source_client, allowed_schemas=allowed_schemas)
target_table_map = get_table_map(target_client)
I expect this is because of inexact name matching we use for --score-cutoff
.
I've added an exploratory commit to feature branch which adds a target schema filter based on the matched source tables. This helps us to avoid running unnecessary dictionary queries in the target schema. However it does not address the fact that we don't appear to be re-using connections for PostgreSQL. I still need to research that.
My previous assertion that "we make a fresh connection per schema" was incorrect. SQLAlchemy is caching the connections so we only take a single connection. The point that we sometimes run more dictionary queries than is necessary still stands but that doesn't explain the reported FATAL: remaining connection slots are reserved for non-replication superuser connections
message.
There was a misunderstanding in the initial issue report, the problem with connection management is when validating multiple tables in a single command and not with find-tables.
I've backed out the optimization I made on this branch due to not fully understanding any side effects. I'll keep the tests though. Diff for the reverted commit is below for reference:
data_validation/__main__.py
@@ -462,7 +462,15 @@ def find_tables_using_string_matching(args):
allowed_schemas = cli_tools.get_arg_list(args.allowed_schemas)
source_table_map = get_table_map(source_client, allowed_schemas=allowed_schemas)
- target_table_map = get_table_map(target_client)
+ target_schema_filter = None
+ if score_cutoff == 1:
+ # No fuzzy matching therefore the schemas matched in the source will apply to the target.
+ target_schema_filter = list(
+ set(_["schema_name"] for _ in source_table_map.values())
+ )
+ target_table_map = get_table_map(
+ target_client, allowed_schemas=target_schema_filter
+ )
Capturing issue report from end user.
Testing
find-tables
to get all the tables so that user needn’t explicitly type all the table names. Noticed that many PostgreSQL connections are opened and ran into the below error: