MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Add `CTE` `latest_runs` to `JOIN` `ON` `jobs.uuid` and apply formatting on sql #2928

Closed wslulciuc closed 1 month ago

wslulciuc commented 1 month ago

This PR adds the CTE following on JobDao.findAll():

latest_runs AS (
  SELECT r.*
  FROM runs AS r
    INNER JOIN (
      SELECT job_uuid, MAX(transitioned_at) AS latest_transitioned_at
      FROM AS runs
      GROUP BY job_uuid
    ) AS latest_run
    ON r.job_uuid = latest_run.job_uuid AND r.transitioned_at = latest_run.latest_transitioned_at
)

The CTE latest_runs ensures we can list job runs that are in a RUNNING state as wells as in a done state (COMPLETED/FAILED). A job version is created, currently, only when a run is in a done state. Therefore, the following JOIN will not return results for runs with state RUNNING, only the done state as they will have job versions associated with the run completion:

 LEFT JOIN runs r
          ON r.uuid = jv.latest_run_uuid

To resolve this issue, and handle runs in a RUNNING state, we replace the JOIN above with :

 LEFT JOIN latest_runs r
        ON r.job_uuid = j.uuid

That is, JOINing the latest runs with the job associated with that given run.

SQL Perf

I'll start by saying, this SQL needs major simplifications, but that's for another PR.

Perf Analysis

EXPLAIN plan

Limit  (cost=77.16..77.17 rows=1 width=726)
  CTE jobs_view_page
    ->  Hash Right Join  (cost=1.29..2.58 rows=5 width=473)
          Hash Cond: (p.uuid = j_2.parent_job_uuid)
          ->  Seq Scan on jobs p  (cost=0.00..1.18 rows=18 width=31)
          ->  Hash  (cost=1.23..1.23 rows=5 width=293)
                ->  Seq Scan on jobs j_2  (cost=0.00..1.23 rows=5 width=293)
                      Filter: ((is_hidden IS FALSE) AND (symlink_target_uuid IS NULL) AND ((namespace_name)::text = 'namespace1538412707'::text))
  CTE job_versions_temp
    ->  Seq Scan on job_versions j_3  (cost=0.00..2.77 rows=49 width=659)
          Filter: ((namespace_name)::text = 'namespace1538412707'::text)
  ->  Sort  (cost=71.81..71.81 rows=1 width=726)
        Sort Key: j.updated_at DESC
        ->  Nested Loop Left Join  (cost=66.08..71.80 rows=1 width=726)
              Join Filter: (j.uuid = j_1.uuid)
              ->  Nested Loop Left Join  (cost=12.40..17.94 rows=1 width=694)
                    Join Filter: (job_facets.run_uuid = jv.latest_run_uuid)
                    ->  Nested Loop  (cost=4.74..10.24 rows=1 width=678)
                          Join Filter: (j.uuid = r.job_uuid)
                          ->  Hash Join  (cost=4.57..8.80 rows=1 width=32)
                                Hash Cond: ((r.job_uuid = runs.job_uuid) AND (r.transitioned_at = (max(runs.transitioned_at))))
                                ->  Seq Scan on runs r  (cost=0.00..3.94 rows=54 width=24)
                                      Filter: ((current_run_state)::text = ANY ('{RUNNING,COMPLETED,FAILED,ABORTED}'::text[]))
                                ->  Hash  (cost=4.30..4.30 rows=18 width=24)
                                      ->  HashAggregate  (cost=3.95..4.12 rows=18 width=24)
                                            Group Key: runs.job_uuid
                                            ->  Seq Scan on runs  (cost=0.00..3.63 rows=63 width=24)
                          ->  Hash Right Join  (cost=0.16..1.38 rows=5 width=678)
                                Hash Cond: (jv.uuid = j.current_version_uuid)
                                ->  CTE Scan on job_versions_temp jv  (cost=0.00..0.98 rows=49 width=32)
                                ->  Hash  (cost=0.10..0.10 rows=5 width=662)
                                      ->  CTE Scan on jobs_view_page j  (cost=0.00..0.10 rows=5 width=662)
                    ->  HashAggregate  (cost=7.66..7.68 rows=1 width=48)
                          Group Key: job_facets.run_uuid
                          ->  Sort  (cost=7.52..7.54 rows=8 width=425)
                                Sort Key: job_facets.lineage_event_time
                                ->  Nested Loop  (cost=0.31..7.40 rows=8 width=425)
                                      ->  Hash Join  (cost=0.16..1.38 rows=5 width=16)
                                            Hash Cond: (jv2.uuid = j2.current_version_uuid)
                                            ->  CTE Scan on job_versions_temp jv2  (cost=0.00..0.98 rows=49 width=32)
                                            ->  Hash  (cost=0.10..0.10 rows=5 width=16)
                                                  ->  CTE Scan on jobs_view_page j2  (cost=0.00..0.10 rows=5 width=16)
                                      ->  Index Scan using job_facets_run_uuid_index on job_facets  (cost=0.14..1.15 rows=6 width=425)
                                            Index Cond: (run_uuid = jv2.latest_run_uuid)
              ->  HashAggregate  (cost=53.68..53.74 rows=5 width=48)
                    Group Key: j_1.uuid
                    ->  Hash Join  (cost=25.91..52.01 rows=333 width=48)
                          Hash Cond: (jtm.tag_uuid = t.uuid)
                          ->  Hash Join  (cost=1.29..26.50 rows=333 width=32)
                                Hash Cond: (jtm.job_uuid = j_1.uuid)
                                ->  Seq Scan on jobs_tag_mapping jtm  (cost=0.00..22.00 rows=1200 width=32)
                                ->  Hash  (cost=1.23..1.23 rows=5 width=16)
                                      ->  Seq Scan on jobs j_1  (cost=0.00..1.23 rows=5 width=16)
                                            Filter: ((namespace_name)::text = 'namespace1538412707'::text)
                          ->  Hash  (cost=16.50..16.50 rows=650 width=48)
                                ->  Seq Scan on tags t  (cost=0.00..16.50 rows=650 width=48)
netlify[bot] commented 1 month ago

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
Latest commit b482060c77a934fce750bb91c0f2d38b1d0079be
Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/670fbef8c5ca4b0008ae3672
codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 81.12%. Comparing base (05d16aa) to head (b482060).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2928 +/- ## ========================================= Coverage 81.12% 81.12% Complexity 1505 1505 ========================================= Files 268 268 Lines 7358 7358 Branches 330 330 ========================================= Hits 5969 5969 Misses 1228 1228 Partials 161 161 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.