DependencyTrack / hyades

Incubating project for decoupling responsibilities from Dependency-Track's monolithic API server into separate, scalable services.
https://dependencytrack.github.io/hyades/latest
Apache License 2.0
61 stars 18 forks source link

Proposal : Move the Repository MetaAnalyzer to Quarkus Application #69

Closed mehab closed 1 year ago

mehab commented 2 years ago

Current Implementation

Currently, the Dependency Track API server has a RepositoryMetaAnalyzer class that analyses a given component based on its purl type to get a meta model of the given component. If the meta model has a latestVersion field that is not null then a metaComponent is created with the latest information from the model and then synchronised with the database. This RepositoryMetaAnalyzer is actively listening for RepositoryMetaEvent.

RepositoryMetaEvent : This is an event type in DT API server that can be triggered from 3 different places:

In the RepositoryMetaAnalyzer, there are two workflows:

Drawbacks of this approach While the portfolio analysis can be seen as a really heavy task from the description above itself that is consuming a lot of resources of the API server. The events containing individual components would also lead to making DT API server resources busy, in the sense that, consider a large SBOM containing 10000 components uploaded resulting in an event for the BOMUploadProcessingTask. Then a repositoryMetaEvent is triggered for every component in the SBOM.

Proposal

We move out the RepositoryMetaAnalyzer to the external Quarkus application that leverages Kafka and we have used it already to offload the VulnerabilityAnalysis from DT Api server. This is the initial proposed design.

Analysing each component

Every time a component is created or updated in the UI via the componentResource or a new BOM is uploaded resulting in a BOMUploadProcessingTask, the proposal is to produce components as events on a Kafka topic (maybe the same topic being used for vulnerabilityAnalysis?). If component does not have a purl associated with it, then we do not need to send this on the kafka topic since in the current implementation, if the model retrieved after analysis has the latestVersion field as null (this would be the case when the component does not have a purl) then the meta analysis has no impact for this.

For each analyser type, we should get the supported repository and then get all the repositories corresponding to the supported type into a Global KTable to be used for this analysis. This KTable would use the repositoryType as the key and the list of supported repositories as the value. A repository might also need a username and password. These are expected to be in the environment variables at the start of the application.

These events contain the component UUID and purl at a minimum. The incoming events can be mapped to different meta analyser consumers by a primary meta analyser consumer based on the type of Purl. Once the analysis is done by the corresponding meta analyser implementation like Maven or Golang, then the fields of a metaComponent object are updated with the meta Model. This marks the completion of the work by the meta analysers. The output containing the metaComponent can be sent back to the DT API server to save it to the database.

RepositoryMetaEvent from Task Scheduler When the repository meta event is triggered by the task scheduler, we are essentially doing a portfolio analysis. Get all the components of all the active projects in DT API server and then each of these components can be sent to the same event as above and be analysed.

Screenshot 2022-10-28 at 14 21 51
syalioune commented 2 years ago

Thanks for the detailed design. Few remarks below :

Every time a component is created or updated in the UI via the componentResource or a new BOM is uploaded resulting in a BOMUploadProcessingTask, the proposal is to produce components as events on a Kafka topic (maybe the same topic being used for vulnerabilityAnalysis?)

I would recommend using the same topic for two reasons :

For each analyser type, we should get the supported repository and then get all the repositories corresponding to the supported type into a Global KTable to be used for this analysis. This KTable would use the repositoryType as the key and the list of supported repositories as the value. A repository might also need a username and password. These are expected to be in the environment variables at the start of the application.

I believe a database will be more fit for this requirement (storing repository configuration and sensitive data). The data ownership is blurry as the repositories configuration is stored in both API Server and Quarkus application whereas it should be the responsibility of the application containing the analyzer. Moreover, by using environment variables for credentials, you prevent them to be updated by the frontend.

Offloading API Server Database does not necessarily imply that other modules should not have their own databases and expose APIs for API server to consume.

We can have something like in the picture below using an appropriate DB for each module, with clear responsibilities and separation of concerns. Analyzer module (current Quarkus app) can have its database and exposing a subset of API Server rest api allowing update and fetching from Frontend.

DT_modularisation

If all DB choices end up being SQL, this allow flexibility as :

To be discussed :smiley:

nscuro commented 2 years ago

For the initial PoC work, and to reduce the amount of new components being added to the system, we're thinking about connecting the external services to the existing DB, but in "read-only mode". For the time being, all they need is access to the configurations managed by the API server, so the addition of dedicated DBs isn't really justifiable right now.

For large-scale deployments, read-replicas of the database could be set up, and the external services could be pointed towards them instead of the "master" DB server. In that case, frequent, high-volume reads of the external services shouldn't impact performance on the API server side.

mehab commented 2 years ago
  • The event model is the same. It does not make sense to have the exact same event in different topics. The whole benefit of Kafka is to be able to plug different consumer to process the same event 😉

I was doubting this because at the time when we have the portfolio analysis triggered (that is every 24 hours once) then having the same topic for both vulnerability analysis and repo meta analyzer would mean that the vulnerability analysis would also start happening for the whole portfolio every 24 hours. Is this okay to have?

mehab commented 2 years ago

I like the diagram that @syalioune shared. But that maybe future work? Since the current plan is as per what @nscuro mentioned above.

syalioune commented 2 years ago

For the initial PoC work, and to reduce the amount of new components being added to the system, we're thinking about connecting the external services to the existing DB, but in "read-only mode". For the time being, all they need is access to the configurations managed by the API server, so the addition of dedicated DBs isn't really justifiable right now.

I agree for the step by step approach. The schema was more to illustrate where I was going hence (a man gotta dream :grin:) the flexibility list part. Main point being the external services need a database (at least some of them). It can't just be replaced by KTable and property files.

In the long run having multiple components but a central API Server (serving all API)/Database (having all tables) personally feels a bit off from an architecture standpoint. For example, since the analyzer logic would be decoupled from the portfolio management, I would expect the Analyzer component to expose a REST API allowing to perform ad-hoc analysis without having to go through the main component. Said API could be called directly from frontend.

syalioune commented 2 years ago
  • The event model is the same. It does not make sense to have the exact same event in different topics. The whole benefit of Kafka is to be able to plug different consumer to process the same event 😉

I was doubting this because at the time when we have the portfolio analysis triggered (that is every 24 hours once) then having the same topic for both vulnerability analysis and repo meta analyzer would mean that the vulnerability analysis would also start happening for the whole portfolio every 24 hours. Is this okay to have?

Currently RespositoryMetaEvent and PortfolioVulnerabilityAnalysisEvent are each fired once every 24H, so I personally think it is ok. Should be validated by Niklas. Broder question would also be : Is it acceptable to have an unique periodicity for both analysis (vuln and repo meta) ? Since with this design, the same event would trigger those analysis.

nscuro commented 2 years ago

Initially I was in favor of reusing the component-analysis topic (thus the generic name). But @mehab has a point, and I think that kicking off both types of analysis at the same time is not always desired, and can put the system under unnecessary stress. After all, all the events emitted during analysis need to be ingested by the API server still.

mehab commented 1 year ago

The implementation is completed and changes present in the main branch of this code repo. https://github.com/mehab/DTKafkaPOC/tree/main/repository-meta-analyzer The modular approach suggested by @syalioune has also been implemented :)