Closed volcan01010 closed 2 years ago
SQLite
SELECT name, type from pragma_table_info('my_table');
PostgreSQL
SELECT
pg_attribute.attname AS name,
pg_catalog.format_type(pg_attribute.atttypid, pg_attribute.atttypmod) AS type
FROM
pg_catalog.pg_attribute
INNER JOIN
pg_catalog.pg_class ON pg_class.oid = pg_attribute.attrelid
INNER JOIN
pg_catalog.pg_namespace ON pg_namespace.oid = pg_class.relnamespace
WHERE
pg_attribute.attnum > 0
AND NOT pg_attribute.attisdropped
AND pg_namespace.nspname ~ COALESCE(NULL, '^.*$')
AND pg_class.relname = 'my_table'
ORDER BY
attnum ASC;
This is based on https://stackoverflow.com/a/58319308/3508733 but with COALESCE statement added to allow for unspecified schema name. This may also require a check for duplicate column names.
Oracle:
SELECT column_name as name, data_type as type FROM ALL_TAB_COLUMNS
WHERE TABLE_NAME = 'DIC_SEN_SENSOR'
AND REGEXP_LIKE(owner, COALESCE('^BGS$', '.*'))
ORDER BY TABLE_NAME
In Oracle, it owner
is the schema name. Note the ^$ on the regexp to match only the given string.
@ximenesuk - can you do an MS SQL Server one when you are back, please?
column_types()
may be a better name for the function.
As part of this ticket, we should find a way to enforce the presence of the query like as an abstract property. Some details here: https://stackoverflow.com/questions/45248243/most-pythonic-way-to-declare-an-abstract-class-property
The __init_subclass__ check seems like a good option. We already specify Python >= 3.6
After trying to use this code in a real-world project, it may be better to return table information as nested dictionaries e.g.
columns = {"id": {"type": "int"}, "name": {"type": "text"}}
This is because we often want to access information for a specific column so it is handy to have it as the key. Alternatively, they data could be converted to the correct form via dictionary comprehension:
columns = describe_columns(...)
column_types = {col.name: col.type for col in columns}
print(column_types['id']) # 'integer'
For reference (and possible future enhancements), here is how SQL Alchemy gets the column information:
For info, below are links to how SQLAlchemy gets the column information for each database type. The methods are more comprehensive, involving Python functions, sub-functions and sometimes multiple queries of the database. These may be relevant if we decide that we want to include more granular information in our queries.
Also interesting is that ETL Helper only needs to add MySQL in order to support the same range of databases as SQL Alchemy does. https://github.com/sqlalchemy/sqlalchemy/tree/main/lib/sqlalchemy/dialects
As part of this ticket, we should find a way to enforce the presence of the query like as an abstract property. Some details here: https://stackoverflow.com/questions/45248243/most-pythonic-way-to-declare-an-abstract-class-property
The init_subclass check seems like a good option. We already specify Python >= 3.6
I had a look at this, but __init_subclass__
works on the class level, before an instance has been instantiated. The DbHelper classes only populate many their attributes on instantiation. This is because we need to be able to import the driver module within a try/except block.
When new DbHelpers are added, an integration test should be included for each to match the existing types. This will catch the case where the query has not been defined.
Note: @ximenesuk and I decided that the function should be called table_info
to be in line with the PRAGMA command used for SQLite.
Summary
As an ETLHelper user, I would like to get metadata on a table e.g. column names and types so that I can programmatically generate SQL queries.
Description
Hard-coded SQL queries can be cumbersome in cases where tables have many columns.
etlhelper
providesload
to programmatically generate insert queries andcopy_table_rows
programmatically copies the rows fromSELECT * FROM table
, however they only cover the simple cases.A
describe_columns(table, conn)
function would return metadata about a database table, with column names and types as a minimum. These data can than be used to programmatically build a query. For example, a geospatial insert query could wrap the geometry column inST_GeomFromText(%(geom)s, 4326)
to convert WKT into a geometry column.Implementation
The SQL query used to select the metadata is different for different database types. As such, it will need to be defined on the DbHelper classes. The queries will use the following internal tables / features:
The
describe_columns(table, conn)
function can live in a new module (utils
). It should execute the SQL query from the DbHelper and return a list ofColumn
named tuples with attributes for at least "name" and "type" ~(and possibly "is_primary_key", "not_null", "default_value", which are provided in SQLite)~.The function should also raise an exception if the table name in ambiguous e.g. if the same table appears in different schemas. One way to detect this would be to detect non-unique column names. This means that the function should accept simple and schema-qualified names.
Acceptance criteria
Calling
describe_columns(table, conn)
against a table returns data for the following connection types:Bad column names raise exception:
Schema qualified names can be used:
Additional criteria
describe_columns
is available at top leveletlhelper
import