Closed achave11-ucsc closed 1 year ago
This was a false positive, since it was the assertions that were failing in the IT test of develop, my changes in 'Send GitLab host logs to CloudWatch' #3894 weren't in the latest develop and the WAF rule set mightn't have been updated.
@hannes-ucsc: "The failing IT emitted 139 requests in a time frame of one minute. We may need to adjust the WAF rate limit. Repeat this analysis for the most recent IT run in prod
."
@hannes-ucsc: "@dsotirho-ucsc to determine if the rate limit applies for the entire WAF in aggregate, or per resource fronted by the WAF."
Spike conclusion:
The most recent IT run on GL prod
on 4/20 yield, at it's peak, 179 requests in a time frame of five minutes, note the following CW query results are over bin(1m)
.
[
{
"bin": "2023-04-20 23:45:00.000",
"status": "200",
"count(*)": "27"
},
{
"bin": "2023-04-20 23:45:00.000",
"status": "301",
"count(*)": "1"
},
{
"bin": "2023-04-20 23:46:00.000",
"status": "302",
"count(*)": "13"
},
{
"bin": "2023-04-20 23:46:00.000",
"status": "301",
"count(*)": "31"
},
{
"bin": "2023-04-20 23:46:00.000",
"status": "200",
"count(*)": "8"
},
{
"bin": "2023-04-20 23:47:00.000",
"status": "302",
"count(*)": "12"
},
{
"bin": "2023-04-20 23:47:00.000",
"status": "200",
"count(*)": "13"
},
{
"bin": "2023-04-20 23:47:00.000",
"status": "301",
"count(*)": "21"
},
{
"bin": "2023-04-20 23:47:00.000",
"status": "404",
"count(*)": "2"
},
{
"bin": "2023-04-20 23:48:00.000",
"status": "200",
"count(*)": "47"
},
{
"bin": "2023-04-20 23:49:00.000",
"status": "200",
"count(*)": "4"
}
]
Additionally, the IT run on 4/06 yield, at it's peak, 150 requests in a time frame of five minutes
[
{
"b": "2023-04-06 22:39:00.000",
"status": "200",
"count(*)": "29"
},
{
"b": "2023-04-06 22:39:00.000",
"status": "301",
"count(*)": "7"
},
{
"b": "2023-04-06 22:39:00.000",
"status": "302",
"count(*)": "5"
},
{
"b": "2023-04-06 22:40:00.000",
"status": "200",
"count(*)": "16"
},
{
"b": "2023-04-06 22:40:00.000",
"status": "302",
"count(*)": "15"
},
{
"b": "2023-04-06 22:40:00.000",
"status": "301",
"count(*)": "22"
},
{
"b": "2023-04-06 22:40:00.000",
"status": "404",
"count(*)": "2"
},
{
"b": "2023-04-06 22:41:00.000",
"status": "200",
"count(*)": "44"
},
{
"b": "2023-04-06 22:42:00.000",
"status": "200",
"count(*)": "6"
},
{
"b": "2023-04-06 22:43:00.000",
"status": "200",
"count(*)": "4"
}
]
This happened again, during sandbox
run https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/67196.
OK, so this looks to be a regression from 92d583e3.
The fix will be to adjust the rate limit to 200 per minute, or 1000 per 5 minutes.
I did some reading and I think the rate limit is per WAF and client IP, so it applies to all resources protected by a WAF (or Web ACL, as Amazon calls it) in aggregate. If we wanted to apply different rates to indexer and service we could scope down the rate-based rule to the Host header.
We need a fix for this before we can promote to production.
Cancelling spike for @dsotirho-ucsc.
No demo. Passing IT will prove efficacy of fix.
https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/67174