Decouple metrics update into standalone service

stevespringett commented 2 years ago

Updating metrics takes a considerable about of time and takes resources away from the DT server when its processing and analyzing SBOMs.

In order to improve performance of the DT ApiServer and the metric updates themselves, a separate service should be created to perform this task.

The service may or may not use the same stack. Non-Java stack options being considered include Erlang, Python, and Go. Go may have the least overhead thus resulting in a smaller container and faster execution.

This service will run independently in a separate container. There will be no need for the DT ApiServer and the MetricsService to communicate with each other.

The MetricsUpdate service will update portfolio, project, component, and vulnerability metrics on a periodic basis, similar to how it is done now.

Additionally, whenever the DT ApiServer consumes or processes a BOM, it will write to a new field and specify that the project is in need of a metric update. The MetricsUpdate service will periodically query (5 min???) the projects table looking for projects in need of updating and perform the necessary updates.

nscuro commented 2 years ago

Some thoughts:

The service should stay away from ORMs as they make understanding and debugging DB issues unnecessarily hard
- Some query tuning (selecting only the columns needed etc.) may also yield additional performance gains
The service should have a "one shot" option. That way its execution can be controlled via CronJob in k8s (or standard cron)
- Also, it doesn't have to idle when there's no work to do
Kicking off the service for specific projects, triggered via BOM_PROCESSED webhook could be a useful feature
Unsurprisingly I'd be in favor of Go regarding the tech stack, but I would also throw these in the ring:
- .NET Core
- Java + Quarkus
- Java + Micronaut
Erlang is pretty exotic, getting contributors will be a challenge

A first MVP could be worked on that implements the "one shot" mode:

connect to DB
calculate metrics for all projects (regardless if needed or not)
exit

Eventually there could be multiple MVP implementations with different tech stacks before a final decision is made.

hostalp commented 2 years ago

Also note that even the current implementation could be likely improved. Currently it seems to struggle here when filling up some collection object:

at java.util.AbstractCollection.addAll(java.base@11.0.12/Unknown Source)
at java.util.HashSet.<init>(java.base@11.0.12/Unknown Source)
at org.datanucleus.ExecutionContextImpl.preCommit(ExecutionContextImpl.java:4158)
at org.datanucleus.ExecutionContextThreadedImpl.preCommit(ExecutionContextThreadedImpl.java:546)
at org.datanucleus.ExecutionContextImpl.transactionPreCommit(ExecutionContextImpl.java:729)
at org.datanucleus.TransactionImpl.internalPreCommit(TransactionImpl.java:397)
at org.datanucleus.TransactionImpl.commit(TransactionImpl.java:287)
at org.datanucleus.api.jdo.JDOTransaction.commit(JDOTransaction.java:99)

Which points to https://github.com/datanucleus/datanucleus-core/blob/datanucleus-core-5.2.7/src/main/java/org/datanucleus/ExecutionContextImpl.java#L4158 This takes longer and longer for each processed project during the task run (1st project starting with just a few ms, then it gradually goes up to some 20 s or so) which suggests that there's something not quite right in the current implementation (perhaps the L1 cache size keeps increasing a lot during the task run).

ruckc commented 2 years ago

Shouldn't the metrics on the dashboard just be a rollup of all the projects? Based on my understanding of MetricUpdateTask.java's loop, if it wasn't for updating the individual project metrics, the actual metrics could be pulled from a database query aggregating all the project data in a single query. In turn, that means, that you could have a view generate the metrics (for h2), and for other databases you could actually make a materialized view that gets automatically updated when projects are updated.

I have been trying to make use of Dependency track for some bulk project analytics, but this scalability issue is a concern... i've had the bulk metricsupdatetask running for 7 days before I started investigating.

nscuro commented 2 years ago

@hostalp This takes longer and longer for each processed project during the task run (1st project starting with just a few ms, then it gradually goes up to some 20 s or so) which suggests that there's something not quite right in the current implementation (perhaps the L1 cache size keeps increasing a lot during the task run).

This was indeed an issue related to DataNucleus' L1 cache and should be fixed in v4.5 by recreating the underlying PersistenceManager for every project (see #1481). Testing showed fairly significant performance gains. According to the DN docs, PersistenceManagers are supposed to be somewhat short-lived, so that resources can be freed accordingly. There are more potential tweaks that we could apply that have to do with how DT currently performs INSERTs and UPDATEs under the hood.

@ruckc Shouldn't the metrics on the dashboard just be a rollup of all the projects? Based on my understanding of MetricUpdateTask.java's loop, if it wasn't for updating the individual project metrics, the actual metrics could be pulled from a database query aggregating all the project data in a single query.

This is true. The main bottleneck is not the portfolio metrics update itself (the "sum up"), but updating metrics for all projects and their components. Because this happens in a single thread, execution time grows with every component being added.

The entire process of updating portfolio metrics can be accelerated by performing project metrics updates concurrently. There are a few ways how this could be achieved today (using DT's event system, or CompletableFutures). But this has the potential of slowing down DT as a whole, blocking BOM ingestion and other analysis tasks.

ruckc commented 2 years ago

@nscuro - from what i've analyzed in trying to break down the task into manageable pieces, it seems like the update each project's metrics should be done in separate tasks, and this task should really just be done in at the database level with an aggregated UPDATE statement. One could take this logic even further and make the dashboard read off of a database view (or materialized view) to keep the dashboard metrics consistent with all the projects at that point in time.

Barring doing it at the database level, just iterating through the project/components and pulling the current database state instead of doing all the processing of the separate projects update tasks would handle this issue.

Breaking the metrics update into a standalone services doesn't directly fix the data accurracy at a point in time, nor the single-threaded nature of the current update and aggregate everything task.

nscuro commented 2 years ago

Your input is greatly appreciated @ruckc, thank you. I agree that handling this directly in the database would be desirable.

Few issues I see with this:

We need historic metrics in order to show a histogram. Having a (materialized) view means we're only getting a single point in time. But I guess we could solve this by periodically "snapshotting" the views.
Maintaining and testing views will be a pain because they tend to work differently across database technologies (H2 vs. PostgreSQL vs. MSSQL vs. MySQL). Managing them with DT's current ORM is also not possible, introducing operational complexity.

That being said, I think it's definitely something we should investigate before introducing additional services, which would be even more complex than just maintaining database views.

I also agree with your statement that just introducing another service alone is not solving the problem. Performing the task's current logic in a more lightweight and concurrent way however, which requires decoupling from the API server, does. Granted it won't be as efficient as doing it directly in the database.

DependencyTrack / dependency-track

Decouple metrics update into standalone service #1210