Open collado-mike opened 2 years ago
Wasn't the running
state introduced for stream processing? In such cases, OL sends start
event and then there is a need for event type sent multiple times while the job is running.
I agree with @pawel-big-lebowski, Marquez should have some awareness of the current state of a job run (if only for historical run analysis of state transition times). Also, job versions are created based on a COMPLETE
event an important detail, which, as you mention @collado-mike, can be determined by the started_at
and ended_at
columns in the runs
table. The RUNNING
state was introduced in Marquez for streaming jobs (but also services), which is not yet officially supported by OpenLineage, but removing it seems premature (at least to me). I think it'd be helpful to have the OpenLineage spec define how out of order events should be handled? And are you suggesting @collado-mike that we drop the run_states
table? I don't think Marquez should manage the run states for jobs (ie state machine like for "allowed" transitions), but rather accept what the current job state is and handle it according. So, with that, some follow up questions:
run_state
table with the current run state for a given job?Marquez should have some awareness of the current state of a job run (if only for historical run analysis of state transition times).
current state awareness and historical analysis are different things. It's specifically the column that tracks current state that I feel is a problem, and specifically because we can't always rely on state transition events to come in the correct order. Because we change the current status column based on the last received event, we are frequently seeing jobs marked as RUNNING
even though we received a COMPLETE
or FAILED
event. See the code at https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/RunService.java#L83-L84 , which blindly updates the state of the run even if the prior state was terminal.
To track state changes for historical analysis, we don't need a current state. We only need state transition times, which I'm not suggesting we eliminate. The run_states
table is used to differentiate between completed and failed jobs (the Run's endRunState
column tells us which state the run terminated with), so I'm not proposing we eliminate that. The currentRunState
column, however, doesn't tell us anything we can't derive from the other columns.
It's specifically the column that tracks current state that I feel is a problem, and specifically because we can't always rely on state transition events to come in the correct order.
Ahh, well that clears things up; thanks for linking to the relevant code, I had a completely different understanding of the change being proposed. Yeah, the RunRow.currentRunState
column was introduced to avoid joining on the runs
table and quickly respond back with the current run metadata (similar to Dataset.currentVersionUuid
and JobRow.currentVersionUuid
). So, I agree the run state can be derived (and with better accuracy) using other columns (i.e. we do keep a reference to RunRow. startRunStateUuid
and RunRow.endRunStateUuid
). Given the blind update logic of run states, the change makes sense to me. Would we drop RunRow.currentRunState
?
Yeah, I think removing the column and updating the existing queries to derive the current state from the start/endRunStateUuid columns is all we'd need to do.
Currently, the Run status column is entirely dependent on the order in which OpenLineage events are processed. OpenLineage events may come out of order, a start and complete event may be processed at the same time, or a start event could be retried after the complete event has been processed. All these have the potential to update the run status to the wrong value (it may be complete, but marked as running).
Rather than updating the current run status based on the last OpenLineage event processed, we can largely rely on the
started_at
andended_at
columns to know if a run is currently running or if it has completed. We do need to know if a run has completed successfully or failed. I think we can update the logic to only set the status field when a run has ended, meaning we'll never mark the task as running after it has completed.