capitalone / giraffez

User-friendly Teradata client for Python
https://capitalone.github.io/giraffez
Apache License 2.0
108 stars 31 forks source link

Duplicate column titles/names will only return one of the columns #46

Open theianrobertson opened 6 years ago

theianrobertson commented 6 years ago

This is a corner case and I'm open to a "won't fix", but came across this and didn't expect the behaviour today: two examples where selecting columns that render out as having the same name will silently squash one of them in output:

Example 1 - Titles:

import giraffez

table_name = 'schema.tester'

drop_sql = f"DROP TABLE {table_name}"
create_sql = f"""CREATE MULTISET TABLE {table_name} (
    field_1 DECIMAL(18,0) TITLE 'Number',
    field_2 DECIMAL(12,0) TITLE 'Number')"""
insert_sql = f"INSERT INTO {table_name} SELECT 1 AS field_1, 2 AS field_2"
read_sql = f"SELECT field_1, field_2 FROM {table_name}"

with giraffez.Cmd() as cmd:
    if cmd.exists(table_name):
        cmd.execute(drop_sql)
    cmd.execute(create_sql)
    cmd.execute(insert_sql)
    res = list(cmd.execute(read_sql).to_dict())
    print(res)
    cmd.execute(drop_sql)

This code results in [{'number': 2.0}], because Column.name uses self.title over self.original_name (here). Note, if you BulkExport out the data, the export.to_str() still works fine, but the to_dict() will overwrite (because it's a dict)

Example 2 - duplicate column names in multiple tables:

create_a = "CREATE MULTISET VOLATILE TABLE tbl_a AS (SELECT 1 AS fld_a) WITH DATA ON COMMIT PRESERVE ROWS"
create_b = "CREATE MULTISET VOLATILE TABLE tbl_b AS (SELECT 1 AS fld_a) WITH DATA ON COMMIT PRESERVE ROWS"
read_sql = "SELECT tbl_a.fld_a, tbl_b.fld_a FROM tbl_a INNER JOIN tbl_b ON tbl_a.fld_a = tbl_b.fld_a"

with giraffez.Cmd() as cmd:
    cmd.execute(create_a)
    cmd.execute(create_b)
    res = cmd.execute(read_sql)
    for row in res:
        print(row)

This code results in {'fld_a': 1}.

If you select the same field name in one query/table (e.g. SELECT 1 AS field_1, 2 AS field_1), Teradata will error out with 3863: Duplicate definition of field.

Wondering if giraffez should exception out when there are duplicates in the list of column names? Or maybe just when calling to_dict() or to_list() on Cmd or Export, which would produce unexpected results? (especially if you're exporting, the fields will still be in export.columns.names but would have fewer records in to_dict() items). Both of these could be fixed by the user providing an alias, but as a user I'd like giraffez to be "noisier" if there was going to be an issue.

ChrisRx commented 6 years ago

For example 2, I believe it could either error with duplicate column definition (probably the safest, easiest and most consistent with Teradata behavior way of solving it) or it could automatically rename duplicate field names for all to_dict() results should they overlap. While the latter might be a nice-to-have, I think it may introduce more problems than it solves, where just implementing what Teradata already does in the form of a duplicate definition of field error seems to be what I would favor.

Example 1 is a bit harder to determine the best course of action. Title has always been preferred to imitate how Teradata's various tools will prefer using the explicitly set name of a column rather than the original name. I wasn't aware one could define a table with the same title for multiple columns, so that definitely complicates things. Is this a case you run into a lot (or ever) where the title is necessary to be the same? And if so, could you provide any context on why one might need this? What was the behavior you would assume would take place in this case, and what would be the behavior you think you would want?

theianrobertson commented 6 years ago

I think raising an exception would work in both cases. The silence is what makes this a bug to me. I can't think of a case where I couldn't just add an alias in the SQL to solve for the exception and re-execute, but without an exception there could be unintended downstream impacts.

(to your point, example 1 is not something that I'd come across a lot - but it's where I first came across it when running a select a, b from table where a and b both had the same title and I didn't have control over the table definition.)

ChrisRx commented 6 years ago

I think for the example 2 that might be pretty easy to implement (pull requests welcome!). With example 1, is the expectation that if there is a pre-existing table that has been defined in this manner that every time it is queried it will raise the exception should the aliases not be set for duplicates?