Switch BwdServer to Cloud Run

pbiggar commented 1 year ago

[x] get bwdserver standing up
- [x] yaml
- [x] envvars
- [x] secrets
[x] come up with testing plan
[ ] get it deployed
- [x] ensure it's not world accessible (internal only)
- [x] deploy via terraform
- [x] create terraform file for deployment
- [x] create terraform file for envvars
- [x] add terraform to docker file
- [x] add bash script to run terraform with the deploy vars in question
- [x] handle envvar situation
- [ ] add deployment via CI
- [x] check whether shipit is appropriate
- [x] merge
[ ] make production ready
- [x] should we be using 2nd generation?
- [x] add startup CPU boost
- [ ] ensure performance is up to scratch
- [ ] SQL statements
  - [ ] see if connecting to google cloud private IP is faster
  - [ ] on hold for now since we have to have some downtime to add a private IP. Will see where things sit a little later
  - [ ] ask GCP folks about faster internal tools
- [ ] compute
  - [ ] run benchmarks with FizzBuzz to measure performance
- [ ] httpclient
  - [ ] test default latency matches GKE version
- [ ] disable access to metadata
- [ ] test resolving DNS to exclude internal metadata
- [ ] or look at restricting on the socket connection level
- [ ] ask cloud run or support on best approach
- [ ] use healthchecks and startupcheck
- [ ] limits/resources
- [ ] check it works
- [ ] Check this issue
- [ ] ensure rollbars arrive
- [ ] ensure pubsub events arrive
- [ ] ensute pushers arrives
- [ ] ensure honeycomb events arrive
- [ ] check that it works (check http params (path, headers, url) arrive ok)
- [ ] check httpclient works
- [ ] check performance is similar/the same (within a few ms)
[ ] expose publicly
- [ ] load balancer
- [ ] cloud armor
- [ ] bwdserver cert
- [ ] check it works
- [ ] check that managed certs work
- [ ] use separate target proxy from bwdserver
- [ ] check that we are able to get a HTTP request instantly once deployed (no 502s in between)
[ ] figure out deployment/migration strategy (https://cloud.google.com/run/docs/migrate/from-kubernetes#strategy)
[ ] migrate cert-manager certs
[ ] switch over DNS

optimizations:

[ ] enable http2/h2c
[ ] do some warmup
[ ] remove extra observability root span

pbiggar commented 1 year ago

Performance testing

Request to BwdServer, which is an ASP.NET/F#/dotnet 6 server. Requests fetch a "hello world" program from the DB, then execute it in an interpreter, saving metrics about the execution in the DB as JSON.

"response" is the http response which includes times up to "execute_handler". "execute_handler" is our intepreter. "custom_domain", "getTLID", "get oplists cache", "get oplist" and "get secrets" are SQL selects. "pusher" is a http call to pusher.com. "traceResultHook" are SQL updates

Cloud Run deployment: (avg latency 1800ms)

response 26/71/30/33/38
custom domain: 12/11/2/3/9
getTLID: 2/8/2/2/2
get oplist cache: 3/9/11/4/10
get oplist: 2/13/8/2/2
get secrets: 3/19/2/9/6
execute_handler: 4/9/4/13/9
traceResultHook: 10/9/2/11/8
traceResultHook: 11/26/7/40/30
pusher: 29/19/18/20/23

K8s deployment: (avg latency 500ms)

response 10/9/9/8/7
custom domain: 1/2/1/1/1
getTLID: 1/2/1/1/1
get oplist cache: 1/1/1/1/1
get oplist: 1/1/1/1/1
get secrets: 2/2/1/1/1
execute_handler: 2/1/2/2/2
traceResultHook: 3/3/3/3/3
traceResultHook: 4/4/5/5/4
pusher: 17/18/18/18/20

pbiggar commented 1 year ago

Testing:

pusher messages didn't make it (traces did make it to the DB though)

pbiggar commented 1 year ago

Pushed CPU to 4:

response 11/17/13/16 custom domain: 2/4/2/2 getTLID: 2/2/1/2 get oplist cache: 2/3/1/2 get oplist: 1/2/1/2 get secrets: 1/3/2/2 execute_handler: 3/3/4/4 traceResultHook: 3/4/5/4 traceResultHook: 5/6/8/25 pusher: 20/21/19/21

pbiggar commented 1 year ago

I speculate that the extra second of latency is due to not using a Global Load Balancer, and will probably be fixed once we do that. Unfortunately that exposes it to the world so will need to solve either the metadata issue or some other way to disable users from running code over here until i solve that.

pbiggar commented 1 year ago

I chatted to a few people about this on twitter, and got some good leads:

raw compute should be very good, but things might be slightly slower due to us using Tau machines on GKE which have great performance.
network should also be good.
some GCP folks have offered to take a look at our setup in the new year

So the fact that things should be fast is a good reason to dig in a little bit. Initial thoughts:

I think the connection to Cloud SQL goes through the public internet which is slower than it should be. We can use internal IP addresses and "Serverless VPC connector" to skip meaningful steps when hitting our DB
Benchmarking a bigger thing (even fizzbuzz) will give more performance info, that will allow me give GCP folks a more accurate sense of things

pbiggar commented 1 year ago

I've been looking into networking around Cloud Run and GCP in general.

Cloud Run does not run in our VPC, and so does not need to be firewalled off from our resources.

Cloud SQL also does not run in our VPC, but there's work needed to route to it from Cloud Run.

[ ] I don't think we need to connect to our VPC
[x] look at whether .NET HttpClient can limit access to various internal IPs (10. or 192.168.)
[ ] connect to DB via private IP and Serverless VPC Access
- [ ] look into Shared VPC
[ ] Cloud Run may have even better networking available (send email to Cloud Run team)

pbiggar commented 1 year ago

[ ] Is it safe to currently add the service to the internet via a load balancer? We can limit the load balancer to only allow the paul-test-cloudrun host or similar.

pbiggar commented 1 year ago

[ ] Test what is the current IP address used for BwdServer, APIServer and QueueWorker requests.
[ ] What IP is used for Cloud Run? Is it backward compatible for users?
[ ] Seems like CloudRun doesn't have an external IP address for making HTTP requests by default, so we may need to do something
[ ] worth looking briefly at latency to external sites or APIs to ensure it's roughly the same

pbiggar commented 1 year ago

Including some traces here because they're interesting