BritishGeologicalSurvey / etlhelper

ETL Helper is a Python ETL library to simplify data transfer into and out of databases.
https://britishgeologicalsurvey.github.io/etlhelper/
GNU Lesser General Public License v3.0
105 stars 25 forks source link

Add describe_columns function #127

Closed volcan01010 closed 2 years ago

volcan01010 commented 2 years ago

Summary

As an ETLHelper user, I would like to get metadata on a table e.g. column names and types so that I can programmatically generate SQL queries.

Description

Hard-coded SQL queries can be cumbersome in cases where tables have many columns. etlhelper provides load to programmatically generate insert queries and copy_table_rows programmatically copies the rows from SELECT * FROM table, however they only cover the simple cases.

A describe_columns(table, conn) function would return metadata about a database table, with column names and types as a minimum. These data can than be used to programmatically build a query. For example, a geospatial insert query could wrap the geometry column in ST_GeomFromText(%(geom)s, 4326) to convert WKT into a geometry column.

Implementation

The SQL query used to select the metadata is different for different database types. As such, it will need to be defined on the DbHelper classes. The queries will use the following internal tables / features:

The describe_columns(table, conn) function can live in a new module (utils). It should execute the SQL query from the DbHelper and return a list of Column named tuples with attributes for at least "name" and "type" ~(and possibly "is_primary_key", "not_null", "default_value", which are provided in SQLite)~.

The function should also raise an exception if the table name in ambiguous e.g. if the same table appears in different schemas. One way to detect this would be to detect non-unique column names. This means that the function should accept simple and schema-qualified names.

Acceptance criteria

Calling describe_columns(table, conn) against a table returns data for the following connection types:

Bad column names raise exception:

Schema qualified names can be used:

Additional criteria

volcan01010 commented 2 years ago

SQLite

SELECT name, type from pragma_table_info('my_table');
volcan01010 commented 2 years ago

PostgreSQL

SELECT
    pg_attribute.attname AS name,
    pg_catalog.format_type(pg_attribute.atttypid, pg_attribute.atttypmod) AS type
FROM
    pg_catalog.pg_attribute
INNER JOIN
    pg_catalog.pg_class ON pg_class.oid = pg_attribute.attrelid
INNER JOIN
    pg_catalog.pg_namespace ON pg_namespace.oid = pg_class.relnamespace
WHERE
    pg_attribute.attnum > 0
    AND NOT pg_attribute.attisdropped
    AND pg_namespace.nspname ~ COALESCE(NULL, '^.*$')
    AND pg_class.relname = 'my_table'
ORDER BY
    attnum ASC;

This is based on https://stackoverflow.com/a/58319308/3508733 but with COALESCE statement added to allow for unspecified schema name. This may also require a check for duplicate column names.

volcan01010 commented 2 years ago

Oracle:

SELECT column_name as name, data_type as type FROM ALL_TAB_COLUMNS
WHERE TABLE_NAME = 'DIC_SEN_SENSOR'
  AND REGEXP_LIKE(owner, COALESCE('^BGS$', '.*'))
ORDER BY TABLE_NAME

In Oracle, it owner is the schema name. Note the ^$ on the regexp to match only the given string.

volcan01010 commented 2 years ago

@ximenesuk - can you do an MS SQL Server one when you are back, please?

volcan01010 commented 2 years ago

column_types() may be a better name for the function.

volcan01010 commented 2 years ago

As part of this ticket, we should find a way to enforce the presence of the query like as an abstract property. Some details here: https://stackoverflow.com/questions/45248243/most-pythonic-way-to-declare-an-abstract-class-property

The __init_subclass__ check seems like a good option. We already specify Python >= 3.6

volcan01010 commented 2 years ago

SQL server options: https://stackoverflow.com/questions/2418527/sql-server-query-to-get-the-list-of-columns-in-a-table-along-with-data-types-no

volcan01010 commented 2 years ago

After trying to use this code in a real-world project, it may be better to return table information as nested dictionaries e.g.

columns = {"id": {"type": "int"}, "name": {"type": "text"}}

This is because we often want to access information for a specific column so it is handy to have it as the key. Alternatively, they data could be converted to the correct form via dictionary comprehension:

columns = describe_columns(...)
column_types = {col.name: col.type for col in columns}
print(column_types['id'])  # 'integer'
volcan01010 commented 2 years ago

For reference (and possible future enhancements), here is how SQL Alchemy gets the column information:

For info, below are links to how SQLAlchemy gets the column information for each database type. The methods are more comprehensive, involving Python functions, sub-functions and sometimes multiple queries of the database. These may be relevant if we decide that we want to include more granular information in our queries.

volcan01010 commented 2 years ago

Also interesting is that ETL Helper only needs to add MySQL in order to support the same range of databases as SQL Alchemy does. https://github.com/sqlalchemy/sqlalchemy/tree/main/lib/sqlalchemy/dialects

volcan01010 commented 2 years ago

As part of this ticket, we should find a way to enforce the presence of the query like as an abstract property. Some details here: https://stackoverflow.com/questions/45248243/most-pythonic-way-to-declare-an-abstract-class-property

The init_subclass check seems like a good option. We already specify Python >= 3.6

I had a look at this, but __init_subclass__ works on the class level, before an instance has been instantiated. The DbHelper classes only populate many their attributes on instantiation. This is because we need to be able to import the driver module within a try/except block.

When new DbHelpers are added, an integration test should be included for each to match the existing types. This will catch the case where the query has not been defined.

volcan01010 commented 2 years ago

Note: @ximenesuk and I decided that the function should be called table_info to be in line with the PRAGMA command used for SQLite.