Refactor to support DT clusters for high availability (HA) and high performance

stevespringett commented 3 years ago

Current Behavior:

In DT 3.8, the frontend was separated from the server. In v4.0 it was further decoupled and the UI was completely removed from the server by default. The server was rebranded API Server with the intent that other server-side components will be available in the future.

The current architecture of the API Server is monolithic which relies on an async event driven queue and task execution subsystem. Under heavly loads, the system can underperform and in some situations, bouncing the app is required.

Proposed Behavior:

Decouple various types of workers into their own projects, that can be deployed and scaled independently. The microservice architecture is not the appropriate approach for DT. But an architecture that incorporates the following will likely be ideal:

Frontend
API Server
Distributed event and task execution queue
Specialized worker nodes

SPIKE

Investigate the feasibility of using Redis and Redisson
Ideally, have Redis deployed by default, or the option to specify an external Redis instance
Experiment with decoupled persistence and model libraries
Experiment with CLI worker nodes that respond to events
Experiment with cluster-wide singletons where there should only be one instance running (NVD and NPM mirroring)
Experiment with secure key replication across cluster, or the use of Docker secrets (or K8s Secrets).

stevespringett commented 3 years ago

See also: #218

lihaoran93 commented 3 years ago

Is there any other temporary way to solve it? The UpdatePortfoliometrics Task has been running for eight hours with 1,100 items(project)。

stevespringett commented 3 years ago

@lihaoran93 If running on VMs or Docker, likely culprits are underpowered machines. Make sure you're using machines optimized for CPU and RAM and that you've given enough of both to the server. You'll also want to look at your database server, especially if its on a VM or using something like RDS. These can be underpowered as well. 1100 projects isn't that many, so the fact its taking that long leads me to believe there's a performance bottleneck somewhere on the hosts.

lihaoran93 commented 3 years ago

Thanks, I'll check the database and CPU。

spmishra121 commented 2 years ago

Hi @stevespringett, Is there any clear steps are defined to implement DT clusters for HA?

spmishra121 commented 2 years ago

Can we go with below steps for the current available version?

Create 2 mysql db in cluster as primary and secondary node.
Use primary node db in docker-compose.yml file
Install 2 instance of DT on different machine.
Configure both db to sync.

fuentecilla86 commented 2 years ago

Hi,

I am playing with HA locally (not with database yet) but I had a problem with two DT servers running. This is my docker-compose.yml:

version: '3.7'

volumes:
  dependency-track:

services:
  dtrack-apiserver:
    image: dependencytrack/apiserver:4.3.6
    environment:
    # Database Properties
    - ALPINE_DATABASE_MODE=external
    - ALPINE_DATABASE_URL=jdbc:postgresql://db:5432/dtrack
    - ALPINE_DATABASE_DRIVER=org.postgresql.Driver
    - ALPINE_DATABASE_USERNAME=dtrack
    - ALPINE_DATABASE_PASSWORD=dtrack
    - ALPINE_DATABASE_POOL_ENABLED=true
    - ALPINE_DATABASE_POOL_MAX_SIZE=20
    - ALPINE_DATABASE_POOL_MIN_IDLE=10
    - ALPINE_DATABASE_POOL_IDLE_TIMEOUT=300000
    - ALPINE_DATABASE_POOL_MAX_LIFETIME=600000
    depends_on:
      - postgres
    deploy:
      replicas: 2
      resources:
        limits:
          memory: 12288m
        reservations:
          memory: 8192m
      restart_policy:
        condition: on-failure
    # ports:
    #   - '8081:8080'
    volumes:
      - 'dependency-track:/data'
    # restart: unless-stopped
    restart: on-failure

  dtrack-frontend:
    image: dependencytrack/frontend:4.3.1
    depends_on:
      - dtrack-apiserver
    environment:
      - API_BASE_URL=http://localhost:8081
    ports:
      - "8080:8080"
    restart: unless-stopped

  db:
    image: postgres:14.2
    expose:
      - "5432"
    environment:
      - POSTGRES_USER=dtrack
      - POSTGRES_PASSWORD=dtrack
      - POSTGRES_DB=dtrack
    volumes:
      - ./docker/postgresql:/var/lib/postgresql

  nginx:
    image: nginx:latest
    volumes:
      - ./docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - dtrack-apiserver
    ports:
      - "8081:8081"

nginx.conf

user  nginx;

events {
    worker_connections   1000;
}
http {
        server {
              listen 8081;
              location / {
                proxy_pass http://dtrack-apiserver:8080;
              }
        }
}

The problem is when I run it, the two servers start trying to write in the same folder /data and the second one crash because of it. Is there any way to control this? Is there something that I'm not paying attention?

LazyAnnoyingStupidIdiot commented 1 year ago

So, a few questions.

If it has issue with the concurrency of the data in the disk, can I just start 1 instance and add the second instance later after it's finished initialising?
Or, can I run this in an AWS fargate container with a postgredb backing? I'm assuming everything in the data directory can be regenerated when the api container is replaced?

nscuro commented 1 year ago

@LazyAnnoyingStupidIdiot

If it has issue with the concurrency of the data in the disk, can I just start 1 instance and add the second instance later after it's finished initialising?

That will not work, because some data that is initialized immediately on startup will also be periodically refreshed / updated afterwards. Lucene search indexes for example (located in /data/index) are updated frequently throughout the application's lifetime.

Or, can I run this in an AWS fargate container with a postgredb backing? I'm assuming everything in the data directory can be regenerated when the api container is replaced?

The /data directory contains keys for secrets encrytion (secret.key), as well as JWT signing / validation (public.key, private.key). While those can be re-generated, it will invalidate all previously issues JWTs, and requires re-cryption of secrets, like API keys for OSS Index, GitHub, Snyk, etc.

LazyAnnoyingStupidIdiot commented 1 year ago

@nscuro thank you for the answers. Very much appreciated.

I see you have mentioned lucene search index. That means NAS (EFS on AWS) would not be too great an idea either?

I'm really hoping for a set up where I don't have to use EC2 instance and its disk storage, but from the look of things this is unavoidable :/

DependencyTrack / dependency-track