DependencyTrack / hyades

Incubating project for decoupling responsibilities from Dependency-Track's monolithic API server into separate, scalable services.
https://dependencytrack.github.io/hyades/latest
Apache License 2.0
59 stars 18 forks source link

Convert `DIRECT_DEPENDENCIES` columns to `JSONB` #1416

Open nscuro opened 1 month ago

nscuro commented 1 month ago

Current Behavior

The DIRECT_DEPENDENCIES column in the PROJECT and COLUMN tables are currently of type TEXT. For the COMPONENT table, there is a GIN_TRGM_OPS index on the column:

https://github.com/DependencyTrack/hyades/blob/96af03d19b004a2c41763dabbb7d082180c99e3c/commons-persistence/src/main/resources/schema.sql#L809-L810

In order to perform dependency graph traversal, queries are using LIKE '%...%' conditions such as:

"COMPONENT"."DIRECT_DEPENDENCIES" LIKE '%48d6c78d-5099-48c8-9bab-2038a5f177e4%'

While this works, there are more efficient ways to achieve the desired outcome in PostgreSQL.

Proposed Behavior

Convert the DIRECT_DEPENDENCIES column in the PROJECT and COMPONENT tables to JSONB.

Drop the existing COMPONENT_DIRECT_DEPENDENCIES_GIN_IDX index, and create a new one using JSONB_PATH_OPS, for example:

CREATE INDEX "COMPONENT_DIRECT_DEPENDENCIES_JSONB_IDX" ON "COMPONENT"
USING GIN("DIRECT_DEPENDENCIES" JSONB_PATH_OPS);

It will then be possible to leverage that index using the JSONB contains operator:

SELECT "NAME"
  FROM "COMPONENT"
 WHERE "DIRECT_DEPENDENCIES" @> '[{"uuid": "92efe3bd-808d-413a-993e-689c1f35713a"}]'::jsonb;

[!WARNING] To make DataNucleus work with JSONB columns, insert-function and update-function extensions must be configured: https://github.com/DependencyTrack/hyades/issues/1074#issuecomment-2198219236

[!WARNING] There might be multiple queries across the code base that perform LIKE queries on the DIRECT_DEPENDENCIES column. Some may use JDOQL (i.e. directDependencies.contains("...")). In particular the latter is problematic, since JDOQL doesn't support the @> JSON operator. Track those queries down, and investigate whether they can be refactored accordingly.

Demo

[!NOTE] The test database contains 1000 projects and TODO components. Generated testdata BOMs were used to populate the database.

Query plan using existing TEXT column:

+--------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on "COMPONENT"  (cost=621.77..665.18 rows=11 width=14) (actual time=7.739..17.003 rows=3 loops=1)                                |
|  Recheck Cond: ("DIRECT_DEPENDENCIES" ~~ '%ed521419-f3ac-4331-ac93-7219b5420133%'::text)                                                         |
|  Rows Removed by Index Recheck: 14                                                                                                               |
|  Heap Blocks: exact=17                                                                                                                           |
|  ->  Bitmap Index Scan on "COMPONENT_DIRECT_DEPENDENCIES_GIN_IDX"  (cost=0.00..621.77 rows=11 width=0) (actual time=7.551..7.551 rows=17 loops=1)|
|        Index Cond: ("DIRECT_DEPENDENCIES" ~~ '%ed521419-f3ac-4331-ac93-7219b5420133%'::text)                                                     |
|Planning Time: 0.427 ms                                                                                                                           |
|Execution Time: 17.213 ms                                                                                                                         |
+--------------------------------------------------------------------------------------------------------------------------------------------------+

Size of index COMPONENT_DIRECT_DEPENDENCIES_GIN_IDX:

SELECT pg_size_pretty(pg_relation_size(indexrelid)) "Index Size"
  FROM pg_stat_all_indexes i
 INNER JOIN pg_class c
    ON i.relid=c.oid
 WHERE i.relname='COMPONENT'
   AND indexrelname = 'COMPONENT_DIRECT_DEPENDENCIES_GIN_IDX';

106 MB

Query plan using JSONB column:

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|Bitmap Heap Scan on "COMPONENT"  (cost=41.29..10042.07 rows=3823 width=14) (actual time=0.130..0.135 rows=3 loops=1)                                |
|  Recheck Cond: ("DIRECT_DEPENDENCIES" @> '[{"uuid": "ed521419-f3ac-4331-ac93-7219b5420133"}]'::jsonb)                                              |
|  Heap Blocks: exact=3                                                                                                                              |
|  ->  Bitmap Index Scan on "COMPONENT_DIRECT_DEPENDENCIES_JSONB_IDX"  (cost=0.00..40.33 rows=3823 width=0) (actual time=0.010..0.010 rows=3 loops=1)|
|        Index Cond: ("DIRECT_DEPENDENCIES" @> '[{"uuid": "ed521419-f3ac-4331-ac93-7219b5420133"}]'::jsonb)                                          |
|Planning Time: 0.100 ms                                                                                                                             |
|Execution Time: 0.163 ms                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------+

Size of index COMPONENT_DIRECT_DEPENDENCIES_JSONB_IDX:

SELECT pg_size_pretty(pg_relation_size(indexrelid)) "Index Size"
  FROM pg_stat_all_indexes i
 INNER JOIN pg_class c
    ON i.relid=c.oid
 WHERE i.relname='COMPONENT'
   AND indexrelname = 'COMPONENT_DIRECT_DEPENDENCIES_JSONB_IDX';

21 MB

Checklist

nscuro commented 1 month ago

For JDOQL queries, we could create a DataNucleus extension to support the @> operator: https://www.datanucleus.org/products/accessplatform_6_0/extensions/extensions.html#rdbms_sql_method