dbcodeio / public

The Power of Databases, The Convenience of VS Code: All in One Place
19 stars 1 forks source link

Duplicate tables in Redshift Connection #18

Closed sgpeter1 closed 1 month ago

sgpeter1 commented 1 month ago

Not sure what could be causing this, but there are duplicate tables listed with a Redshift connection. Sometimes one with 0 rows, sometimes one with partial rows.

Also it took a very, very long time to load the schema/table explorer.

image

mikeburgh commented 1 month ago

That's odd, can you try run the following SQL and just confirm the tables column does not contain multiple tables ?

;WITH schemas AS (
    select oid as schema_id, nspname as schema_name
    from pg_catalog.pg_namespace
)
,tables AS (
    select pgc.oid as table_id
           ,table_name
           ,table_type
           ,table_schema
           ,obj_description(pgc.oid) as comment
       ,reltuples as rows
    FROM information_schema.tables t
    JOIN pg_catalog.pg_class pgc ON t.table_name = pgc.relname
    Where table_type = 'BASE TABLE'
)
select schema_id as id, schema_name as name,
    array(

            select table_name::text
            from tables
            where tables.table_schema = x.schema_name
            order by table_name
    ) as tables
from schemas as x 
order by schema_name;

It's a cut down version of the introspection script that is relevant for the tables, there is no joining so not sure what would cause them to duplicate.

The speed is due to the introspection SQL that is run when you connect to gather all the schema's and data for the database you connect to. It happens on each connect, but the first is the worst, after that it happens in the background and updates once complete.

sgpeter1 commented 1 month ago

Yes, there are duplicate identical listings in the result SQL!

mikeburgh commented 1 month ago

So just to confirm, the rows are duplicates ? So the same schema is listed twice with the same data…

Or just within the tables column at the end there is duplicate table names ?

sgpeter1 commented 1 month ago

If I just run the tables CTE, I see this duplication. Despite this, I don't see duplication in other SQL tools. I suppose they must be using a different introspection approach? image

mikeburgh commented 1 month ago

Ahh thanks, perfect… it’s the rows column redshift must have multiple rows in that table for each table

mikeburgh commented 1 month ago

Okay there is nothing in the postgres docs about why there would be two rows for each table in there.

I am hoping it's something specific like the index and table share the same name, hence in the first script below I filtered by kind

Can you try these two and let me know if either stop the duplication, and show the right approx counts for the rows (eg not the 0.001

select pgc.oid as table_id
           ,table_name
           ,table_type
           ,table_schema
           ,obj_description(pgc.oid) as comment
       ,reltuples as rows
    FROM information_schema.tables t
    JOIN pg_catalog.pg_class pgc ON t.table_name = pgc.relname and pgc.relkind = 'r'
    Where table_type = 'BASE TABLE'
select pgc.oid as table_id
       ,table_name
       ,table_type
       ,table_schema
       ,obj_description(pgc.oid) as comment
       ,pgc.reltuples as rows
FROM information_schema.tables t
JOIN 
    (
        select Max(reltuples) as reltuples , oid, relname
        from pg_catalog.pg_class  
        group by oid, relname
    )  pgc 
    on t.table_name = pgc.relname
Where table_type = 'BASE TABLE'
sgpeter1 commented 1 month ago

Unfortunately, both queries have duplication. Part of the issue is the the table_id is unique between the duplicates.

Maybe it's a Redshift-specific issue? The Redshift docs have examples of catalog queries that might be helpful. Incidentally, I don't have access to the STV_TBL_PERM, but I think you might be able to sidestep that table, especially given the note that "Amazon Redshift table names are stored in both PG_TABLES and STV_TBL_PERM; where possible, use PG_TABLES to return Amazon Redshift table names."

mikeburgh commented 1 month ago

Ahh, oid's are duplicated...

Okay one more try.. can you run this:

select oid, relname, relnamespace
from pg_catalog.pg_class 

When the tablenames (relname in that query) are duplicated with the different oid's are the relnamespace also different ?

Failing that I can drop the oid, we loose comments for the table though.

sgpeter1 commented 1 month ago

Yes, the relnamespace is different, unfortunately.

mikeburgh commented 1 month ago

Actually that's good news.. and it matches what I just replicated...

I think you might have (or redshift has done it for you) the same table names in different schemas... and this query was not handling that...

Last check... This query does not duplicate them right (or well it will, but they will show up in a different table_schema)

select pgc.oid as table_id
           ,table_name
           ,table_type
           ,table_schema
           ,obj_description(pgc.oid) as comment
       ,reltuples as rows
    FROM information_schema.tables t
    join pg_catalog.pg_namespace n on (table_schema = nspname)
    JOIN pg_catalog.pg_class pgc ON n.oid = pgc.relnamespace and t.table_name = pgc.relname and pgc.relkind = 'r' 
    Where table_type = 'BASE TABLE'
sgpeter1 commented 1 month ago

No, the only columns that are different here for the duplicated records are table_id and rows.

mikeburgh commented 1 month ago

Okay, can you try Version 1.2.2 and let me know if it solves the duplication ?

sgpeter1 commented 1 month ago

Duplication resolved! Thank you.

mikeburgh commented 1 month ago

Thanks for all the testing, and patience!