TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

Should output module always run? #1551

Closed ninjapapa closed 5 years ago

ninjapapa commented 5 years ago

Currently output module re-run the same way as other modules, so basically if there is a persisted data with the same hash-of-hash, it will NOT rerun. However, it may not be the desired behavior since user may somehow deleted the output file/table, and expect rerun the output module will recover it.

AliTajeldin commented 5 years ago

In old "output" module paradigm, modules were not "run" in the sense that the DF is recomputed but re-run to publish the output. Now that "output" modules are pure publish, we should re-run them every time. As a user, that is what I would expect as there is no way to compare the publish result to the current output to determine if we need to re-publish

ninjapapa commented 5 years ago

@AliTajeldin make sense.

ninjapapa commented 5 years ago

Original idea

Put the real write operation into a "post_run" method. Later figured out that can just put it in _post_action method, since it always after run method, and will always be called even ephemeral.

However with more thought, it is not ideal:

Will do the following

Explicit Approach

Current entry point to module from the running is _get_data. Need to make that entry point _do_it. Then _do_it can call _get_data for regular modules, for output module, _do_it just call doRun directly, and still put the write operation in doRun.

Within output's _do_it, will call _run_ancestor_and_me_postAction since the write operation guarantees an action.