GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
394 stars 110 forks source link

find-tables exhausted PostgreSQL connections #1194

Closed nj1973 closed 1 month ago

nj1973 commented 1 month ago

Capturing issue report from end user.

Testing find-tables to get all the tables so that user needn’t explicitly type all the table names. Noticed that many PostgreSQL connections are opened and ran into the below error:

File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "127.0.0.1", port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections
nj1973 commented 1 month ago

It looks like we make a fresh connection per schema which could equate to lot of connections in some systems. If we provide a list of allowed schemas then this is only applied to the source and we still get a per schema connection for all schemas on the target.

    source_table_map = get_table_map(source_client, allowed_schemas=allowed_schemas)
    target_table_map = get_table_map(target_client)

I expect this is because of inexact name matching we use for --score-cutoff.

nj1973 commented 1 month ago

I've added an exploratory commit to feature branch which adds a target schema filter based on the matched source tables. This helps us to avoid running unnecessary dictionary queries in the target schema. However it does not address the fact that we don't appear to be re-using connections for PostgreSQL. I still need to research that.

nj1973 commented 1 month ago

My previous assertion that "we make a fresh connection per schema" was incorrect. SQLAlchemy is caching the connections so we only take a single connection. The point that we sometimes run more dictionary queries than is necessary still stands but that doesn't explain the reported FATAL: remaining connection slots are reserved for non-replication superuser connections message.

nj1973 commented 1 month ago

There was a misunderstanding in the initial issue report, the problem with connection management is when validating multiple tables in a single command and not with find-tables.

I've backed out the optimization I made on this branch due to not fully understanding any side effects. I'll keep the tests though. Diff for the reverted commit is below for reference:

data_validation/__main__.py
@@ -462,7 +462,15 @@ def find_tables_using_string_matching(args):

     allowed_schemas = cli_tools.get_arg_list(args.allowed_schemas)
     source_table_map = get_table_map(source_client, allowed_schemas=allowed_schemas)
-    target_table_map = get_table_map(target_client)
+    target_schema_filter = None
+    if score_cutoff == 1:
+        # No fuzzy matching therefore the schemas matched in the source will apply to the target.
+        target_schema_filter = list(
+            set(_["schema_name"] for _ in source_table_map.values())
+        )
+    target_table_map = get_table_map(
+        target_client, allowed_schemas=target_schema_filter
+    )