chrthomsen / pygrametl

Official repository for pygrametl - ETL programming in Python
http://pygrametl.org
BSD 2-Clause "Simplified" License
289 stars 41 forks source link

allowsideeffectsonrows option added #43

Closed chrthomsen closed 2 years ago

chrthomsen commented 2 years ago

SlowlyChangingDimension now has an optional argument deciding if scdensure has side-effects on its argument

skejserjensen commented 2 years ago

The implementation for SlowlyChangingDimension looks good to me. I updated the docstring for SlowlyChangingDimension.__init__() so the order of the parameters in the docstring matches the actual parameters. Also, the addition of allowsideeffectsonrows had not been added to CHANGELOG.rst. However, should allowsideeffectsonrows not be supported by all classes that modifies the rows? As far as I am aware that is TypeOneSlowlyChangingDimension, SlowlyChangingDimension, and SnowflakedDimension.

chrthomsen commented 2 years ago

You're right that it is surprising to only do this for a single method in one class. And on second thoughts, the change proposed in this PR is probably not useful. Generally, pygrametl has side-effects or argument rows from scdensure and insert (and thus also ensure). In particular, the key is added if it is not already present (and scdensure adds version info as well.) This is done to make the row hold all values needed to insert the row into the database.

This PR then attempted to introduce the option allowsideeffectsonrows to make copies of rows before they are modified. This should then be done both in insert and scdensure. However, that is probably not not going to work as we sometimes (in SnowflakedDimension) need to have access to the side-effects made by the different participating Dimension instances.

So in the cases where side-effects are not tolerable, the simplest seems to be that the user makes a copy of the row before passing it to insert or (scd)ensure: keyval = dim.insert(row.copy())