dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

S3 archive errors with global namespaces in multiple regions #440

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

Expected Behavior

S3 Archival works with global namespaces + multi cluster deployment in different regions, based on this excerpt from the docs:

Archival is supported in Global NamespacesLink preview icon (Namespaces that span multiple clusters). When Archival is running in a Global Namespace, it first runs on the active cluster; later it runs on the standby cluster. Before archiving, a history check is done to see what has been previously archived.

Actual Behavior

Global namespace created with S3 archival enabled in region A (both active cluster and archive bucket), and after failing namespace over to cluster in region B (with archive bucket in region B), archive functionality in region B fails with errors similar to the following. Note that the log is from cluster B, but contains the archival URI that points to the bucket in region A (the default for cluster A)..

{
    "id": "<redacted>",
    "content": {
        "timestamp": "2023-02-26T01:06:25.798Z",
        "tags": [
            "region:<region-b>"
        ],
        "service": "temporal",
        "attributes": {
            "msg": "failed to archive target",
            "shard-id": 24,
            "level": "error",
            "logger": "temporal",
            "source": "stdout",
            "error": "BadRequest: Bad Request\n\tstatus code: 400, request id: <redacted>, host id: <redacted>",
            "archival-caller-service-name": "history",
            "archival-URI": "s3://<region-a-bucket>",
            "target": "history",
            "caller": "log/with_logger.go:72",
            "stacktrace": "go.temporal.io/server/common/log.(*withLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.20.0/common/log/with_logger.go:72\ngo.temporal.io/server/common/log.(*withLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.20.0/common/log/with_logger.go:72\ngo.temporal.io/server/service/history/archival.(*archiver).recordArchiveTargetResult\n\t/go/pkg/mod/go.temporal.io/server@v1.20.0/service/history/archival/archiver.go:244\ngo.temporal.io/server/service/history/archival.(*archiver).archiveHistory\n\t/go/pkg/mod/go.temporal.io/server@v1.20.0/service/history/archival/archiver.go:191\ngo.temporal.io/server/service/history/archival.(*archiver).Archive.func2\n\t/go/pkg/mod/go.temporal.io/server@v1.20.0/service/history/archival/archiver.go:162",
            "archival-request-namespace-id": "<redacted>",
            "service": "temporal",
            "archival-request-workflow-id": "<redacted>",
            "archival-request-run-id": "<redacted>",
            "archival-request-namespace": "<redacted>",
            "archival-request-close-failover-version": 2,
            "timestamp": 1677373585798,
            "ts": 1677373585.797902
        }
    }
}

Steps to Reproduce the Problem

  1. Deploy multi cluster with separate S3 buckets in respective regions
  2. Create global namespace with archive enabled pointing to bucket in region A
  3. Failover global namespace to cluster in region B

Specifications

Comments

I believe this is due to the various S3 calls using the same underlying AWS session that's configured at launch with the cluster's default archival region (i.e. cluster A uses an s3 client that points to region A, cluster B uses an S3 client that points to region B, but after failover, cluster B attempts to interact with S3 objects in the other region.