MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.77k stars 315 forks source link

Eliminate RUNNING state in run status column #2264

Open collado-mike opened 1 year ago

collado-mike commented 1 year ago

Currently, the Run status column is entirely dependent on the order in which OpenLineage events are processed. OpenLineage events may come out of order, a start and complete event may be processed at the same time, or a start event could be retried after the complete event has been processed. All these have the potential to update the run status to the wrong value (it may be complete, but marked as running).

Rather than updating the current run status based on the last OpenLineage event processed, we can largely rely on the started_at and ended_at columns to know if a run is currently running or if it has completed. We do need to know if a run has completed successfully or failed. I think we can update the logic to only set the status field when a run has ended, meaning we'll never mark the task as running after it has completed.

pawel-big-lebowski commented 1 year ago

Wasn't the running state introduced for stream processing? In such cases, OL sends start event and then there is a need for event type sent multiple times while the job is running.

wslulciuc commented 1 year ago

I agree with @pawel-big-lebowski, Marquez should have some awareness of the current state of a job run (if only for historical run analysis of state transition times). Also, job versions are created based on a COMPLETE event an important detail, which, as you mention @collado-mike, can be determined by the started_at and ended_at columns in the runs table. The RUNNING state was introduced in Marquez for streaming jobs (but also services), which is not yet officially supported by OpenLineage, but removing it seems premature (at least to me). I think it'd be helpful to have the OpenLineage spec define how out of order events should be handled? And are you suggesting @collado-mike that we drop the run_states table? I don't think Marquez should manage the run states for jobs (ie state machine like for "allowed" transitions), but rather accept what the current job state is and handle it according. So, with that, some follow up questions:

collado-mike commented 1 year ago

Marquez should have some awareness of the current state of a job run (if only for historical run analysis of state transition times).

current state awareness and historical analysis are different things. It's specifically the column that tracks current state that I feel is a problem, and specifically because we can't always rely on state transition events to come in the correct order. Because we change the current status column based on the last received event, we are frequently seeing jobs marked as RUNNING even though we received a COMPLETE or FAILED event. See the code at https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/RunService.java#L83-L84 , which blindly updates the state of the run even if the prior state was terminal.

To track state changes for historical analysis, we don't need a current state. We only need state transition times, which I'm not suggesting we eliminate. The run_states table is used to differentiate between completed and failed jobs (the Run's endRunState column tells us which state the run terminated with), so I'm not proposing we eliminate that. The currentRunState column, however, doesn't tell us anything we can't derive from the other columns.

wslulciuc commented 1 year ago

It's specifically the column that tracks current state that I feel is a problem, and specifically because we can't always rely on state transition events to come in the correct order.

Ahh, well that clears things up; thanks for linking to the relevant code, I had a completely different understanding of the change being proposed. Yeah, the RunRow.currentRunState column was introduced to avoid joining on the runs table and quickly respond back with the current run metadata (similar to Dataset.currentVersionUuid and JobRow.currentVersionUuid). So, I agree the run state can be derived (and with better accuracy) using other columns (i.e. we do keep a reference to RunRow. startRunStateUuid and RunRow.endRunStateUuid). Given the blind update logic of run states, the change makes sense to me. Would we drop RunRow.currentRunState?

collado-mike commented 1 year ago

Yeah, I think removing the column and updating the existing queries to derive the current state from the start/endRunStateUuid columns is all we'd need to do.