chrthomsen / pygrametl

Official repository for pygrametl - ETL programming in Python
http://pygrametl.org
BSD 2-Clause "Simplified" License
289 stars 41 forks source link

Prevent type two updates if only the fromatt is different but other attributes are equal #44

Closed iiLaurens closed 2 years ago

iiLaurens commented 2 years ago

When a row is exactly equal but with a different fromatt (because some underlying data-generating process was run again today), the SlowlyChangingDimension will still try to insert a new row.

For example, suppose the other row that was previously in the database is:

{'rowid': 6006,
 'lookupatt': 'da17b169-9bce-4205-9a73-b8c645e215e6',
 'fromatt': datetime.datetime(2022, 5, 2, 22, 0, tzinfo=datetime.timezone.utc),
 'toatt': None,
 'value': 'Foo'
}

And a new row is is evaluated against it:

{'lookupatt': 'da17b169-9bce-4205-9a73-b8c645e215e6',
 'fromatt': datetime.datetime(2022, 5, 9, 22, 0, tzinfo=datetime.timezone.utc),
 'toatt': None,
 'value': 'Foo',
}

In this case, the other row is closed and a new row is inserted. But effectively nothing changed in the actual data (the value column). In my opinion, in this case no update should be issued at all. Do you think an update like this could be caught and intercepted?

chrthomsen commented 2 years ago

I can see that it in the described case would make sense not to compare the from dates. In other scenarios, such as the example with applying tests to web pages downloaded on a given date, it does make sense to compare the dates.

Wouldn't it be possible to get the behavior you need by setting the srcdateatt to None such that the dates are not compared? You would then also need to give a fromfinder, but in your case that could just be a lambda expression that returns the fromatt from the row (and that would only be called when a new version is to be added).

Does that solve your issue - or did I misunderstand something?