DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Transient failure of OpenAPIIntegrationTest #5179

Closed achave11-ucsc closed 1 year ago

achave11-ucsc commented 1 year ago

https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/67174

achave11-ucsc commented 1 year ago

This was a false positive, since it was the assertions that were failing in the IT test of develop, my changes in 'Send GitLab host logs to CloudWatch' #3894 weren't in the latest develop and the WAF rule set mightn't have been updated.

achave11-ucsc commented 1 year ago

@hannes-ucsc: "The failing IT emitted 139 requests in a time frame of one minute. We may need to adjust the WAF rate limit. Repeat this analysis for the most recent IT run in prod."

achave11-ucsc commented 1 year ago

@hannes-ucsc: "@dsotirho-ucsc to determine if the rate limit applies for the entire WAF in aggregate, or per resource fronted by the WAF."

achave11-ucsc commented 1 year ago

Spike conclusion:

The most recent IT run on GL prod on 4/20 yield, at it's peak, 179 requests in a time frame of five minutes, note the following CW query results are over bin(1m).

[
    {
        "bin": "2023-04-20 23:45:00.000",
        "status": "200",
        "count(*)": "27"
    },
    {
        "bin": "2023-04-20 23:45:00.000",
        "status": "301",
        "count(*)": "1"
    },
    {
        "bin": "2023-04-20 23:46:00.000",
        "status": "302",
        "count(*)": "13"
    },
    {
        "bin": "2023-04-20 23:46:00.000",
        "status": "301",
        "count(*)": "31"
    },
    {
        "bin": "2023-04-20 23:46:00.000",
        "status": "200",
        "count(*)": "8"
    },
    {
        "bin": "2023-04-20 23:47:00.000",
        "status": "302",
        "count(*)": "12"
    },
    {
        "bin": "2023-04-20 23:47:00.000",
        "status": "200",
        "count(*)": "13"
    },
    {
        "bin": "2023-04-20 23:47:00.000",
        "status": "301",
        "count(*)": "21"
    },
    {
        "bin": "2023-04-20 23:47:00.000",
        "status": "404",
        "count(*)": "2"
    },
    {
        "bin": "2023-04-20 23:48:00.000",
        "status": "200",
        "count(*)": "47"
    },
    {
        "bin": "2023-04-20 23:49:00.000",
        "status": "200",
        "count(*)": "4"
    }
]

Additionally, the IT run on 4/06 yield, at it's peak, 150 requests in a time frame of five minutes

[
    {
        "b": "2023-04-06 22:39:00.000",
        "status": "200",
        "count(*)": "29"
    },
    {
        "b": "2023-04-06 22:39:00.000",
        "status": "301",
        "count(*)": "7"
    },
    {
        "b": "2023-04-06 22:39:00.000",
        "status": "302",
        "count(*)": "5"
    },
    {
        "b": "2023-04-06 22:40:00.000",
        "status": "200",
        "count(*)": "16"
    },
    {
        "b": "2023-04-06 22:40:00.000",
        "status": "302",
        "count(*)": "15"
    },
    {
        "b": "2023-04-06 22:40:00.000",
        "status": "301",
        "count(*)": "22"
    },
    {
        "b": "2023-04-06 22:40:00.000",
        "status": "404",
        "count(*)": "2"
    },
    {
        "b": "2023-04-06 22:41:00.000",
        "status": "200",
        "count(*)": "44"
    },
    {
        "b": "2023-04-06 22:42:00.000",
        "status": "200",
        "count(*)": "6"
    },
    {
        "b": "2023-04-06 22:43:00.000",
        "status": "200",
        "count(*)": "4"
    }
]
achave11-ucsc commented 1 year ago

This happened again, during sandbox run https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/67196.

hannes-ucsc commented 1 year ago

OK, so this looks to be a regression from 92d583e3.

The fix will be to adjust the rate limit to 200 per minute, or 1000 per 5 minutes.

I did some reading and I think the rate limit is per WAF and client IP, so it applies to all resources protected by a WAF (or Web ACL, as Amazon calls it) in aggregate. If we wanted to apply different rates to indexer and service we could scope down the rate-based rule to the Host header.

We need a fix for this before we can promote to production.

hannes-ucsc commented 1 year ago

Cancelling spike for @dsotirho-ucsc.

hannes-ucsc commented 1 year ago

No demo. Passing IT will prove efficacy of fix.