chrthomsen / pygrametl

Official repository for pygrametl - ETL programming in Python
http://pygrametl.org
BSD 2-Clause "Simplified" License
289 stars 41 forks source link

Lookupatts cannot be NoneType even if defaultid is set #71

Open fromm1990 opened 1 month ago

fromm1990 commented 1 month ago

Setting the defaultidvalue does not prevent insert when lookupatt is None

PostgreSQL throws the following error:

Exception has occurred: NotNullViolation null value in column "country_name" of relation "country_dim" violates not-null constraint

Expected behaviour: defaultidvalue is used when lookupatt(s) is None.

CREATE TABLE IF NOT EXISTS dim.country_dim(
    country_id               SMALLINT     PRIMARY KEY,
    country_name             VARCHAR(56)  UNIQUE NOT NULL,
    alpha2                   CHAR(2)      UNIQUE NOT NULL,
    alpha3                   CHAR(3)      UNIQUE NOT NULL
);

INSERT INTO dim.country_dim(country_id, country_name, alpha2, alpha3)
    VALUES (-1, 'unknown', 'XX', 'XXX');
from pygrametl import ConnectionWrapper
from pygrametl.tables import CachedDimension

cw = ConnectionWrapper(...)

dim = CachedDimension(
        name="dim.country_dim",
        key="country_id",
        attributes=[
            "country_name",
            "alpha2",
            "alpha3"
        ],
        lookupatts=["country_name"],
        targetconnection=cw,
        defaultidvalue=-1,
)
rows = [
    {"country_name": "Australia", "alpha2": "AU", "alpha3": "AUS"},
    {"country_name": "Brazil", "alpha2": "BR", "alpha3": "BRA"},
    {"country_name": "Canada", "alpha2": "CA", "alpha3": "CAN"},
    {"country_name": "Denmark", "alpha2": "DK", "alpha3": "DNK"},
    {"country_name": None, "alpha2": None, "alpha3": None},
]

for row in rows:
    dim.ensure(row)
fromm1990 commented 1 month ago

The error persist even with the lookupatt being set to "unknown".

rows = [
    {"country_name": "Australia", "alpha2": "AU", "alpha3": "AUS"},
    {"country_name": "Brazil", "alpha2": "BR", "alpha3": "BRA"},
    {"country_name": "Canada", "alpha2": "CA", "alpha3": "CAN"},
    {"country_name": "Denmark", "alpha2": "DK", "alpha3": "DNK"},
    {"country_name": "unknown", "alpha2": None, "alpha3": None},
]

Only by removing the defaultidvalue parameter does the error disappear.

dim = CachedDimension(
        name="dim.country_dim",
        key="country_id",
        attributes=[
            "country_name",
            "alpha2",
            "alpha3"
        ],
        lookupatts=["country_name"],
        targetconnection=cw
)

Am I using the defaultidvalue parameter incorrectly?

chrthomsen commented 1 month ago

Hi,

ensure tries to lookup the given row and if that does not succeed, ensure will insert the row. It does not make special assumptions about what to do if (some of) the lookupatts contain None. So for the first example, it tries to insert the given row (which happens to contain None for country_name), but this gets rejected by PostgreSQL (due to a NOT NULL violation). I'm not sure if it would make sense to have special behavior in pygrametl for None values in lookupatts. Should the special behavior only apply if they are all None or should it also apply when some of them are?

In the second example, ensure does lookup a row with "country_name"=="unknown" and that will find your preloaded member. But due to tables.py#L332, that value will not be returned when defaultidvalue has the same value. Instead, pygrametl will attempt to insert the row and thus there is another NOT NULL violation (due to alpha2/alpha3).

To solve your current problem, I guess you could make your code only call ensure if there are non-None values for all lookupatts (and when they contain None, you can make your fact row point to the preloaded (-1, "unknown", ...) dimension member.

And if dim.country_dim is preloaded, you can use lookup instead of ensure.

Best regards, Christian

fromm1990 commented 1 month ago

@chrthomsen thanks for the quick response.

My current "workaround" is simply to substitute None values in lookupatt (country_name) with the "unknown" string. However this does only works if i remove the ´defaultidvalue´ parameter from the CachedDimension constructor. Otherwise I receive the NOT NULL violation on alpha2 and alpha3 (which is a bit surprising to me).

To sum it up, it seems that the ´defaultidvalue´ changes the insert behavior of the ensure method which I did not expect. This may be intended behavior, but surprising to me non the less : )

It's not at all a dealbreaker, just something i found interesting.