DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Reindexing is disruptive in stable deployments #6026

Open hannes-ucsc opened 8 months ago

hannes-ucsc commented 8 months ago

Users complain that deployments are slow during reindexing, that the index is incomplete and that they can't download manifests. To address this we would need to perform a reindex in the background, so to speak, ideally without doubling the infrastructure cost.

dsotirho-ucsc commented 8 months ago

Assignee to consider next steps.

hannes-ucsc commented 8 months ago

In stable zones (e.g. prod), manage two blue/green deployments of Azul.

service.1.azul.data.humancellatlas.org indexer.1.azul.data.humancellatlas.org

indexer.2.azul.data.humancellatlas.org service.2.azul.data.humancellatlas.org

Above are the canonical domain names of the indexer and service in both of the deployments. Note that the hosted zone is azul.data.humancellatlas.org for both deployments.

Additionally, there will be alias A records that point to one of the deployments:

indexer.azul.data.humancellatlas.org -> indexer.1.azul.data.humancellatlas.org service.azul.data.humancellatlas.org -> service.1.azul.data.humancellatlas.org

or

indexer.azul.data.humancellatlas.org -> indexer.2.azul.data.humancellatlas.org service.azul.data.humancellatlas.org -> service.2.azul.data.humancellatlas.org

Note that alias A records are different from CNAME records. The latter require an extra DNS round-trip. The former don't. Users won't typically be aware of the canonical domain names. Publicly, we'll communicate only the alias record names.

Each ACM certificate in these two deployments will list the canonical domain record as the common name and the alias record as the subject alternative name.

Each deployment will have its own ES domain.

Each deployment can either be online or on standby. If it's on standby, its ES domain (the expensive part of the infrastructure) will not exist. In case of a reindex, both deployments will be online, and one of them, the active one, will serve users and host the current version of the code and index. The inactive one will host the next version of the code and will be running a reindex. When the reindex is done, the alias records will be switched. After validation, the inactive deployment will be put into stand-by, to save cost.

A script will be used to manage the alias record and to destroy the ES domain of a deployment in standby.

There will be one branch for each deployment: prod1 and prod2 in this example.

Catalogs are orthogonal to this. During the addition of a catalog for, say, a new HCA release, we'll take the inactive deployment out of standby, push the catalog change to the corresponding branch (prod1 or prod2) and kick off a reindex of that catalog. When the reindex is complete, we'll activate the deployment, deactivate the other deployment, and, once we are sure that the reindex was successful, put it into standby. When the review of the new release by the wranglers is complete, we'll delete the catalog for the previous release from the currently active deployment without switching active deployments, just like we do today.