Warning on internal RSG can cause trouble

Sytten commented 3 years ago

We had an outage two weeks ago on cloud run and I learned the existence of RSG. We experienced spikes in latency between our service and our database up to 20-30s. We were able to limit the problem by reducing the concurrency level to our DB connection pool size and scale the service. I think this would be valuable information to the FAQ to demystify slowly the black box. AWS is usually pretty open about the components that build their stuff but I really had to push just to get a root cause of our issue.

The root cause is a recent change which caused the egress traffic loads unevenly distributed between RSG (Remote Socket Gateway)canary job and main job, causing canary job overloaded, which resulted in longer egress internet networking latency increase.

The software engineers has been detected and drained the RSG which provides a proxy for App Engine Standard, Cloud Run, and Cloud Function apps to make outbound TCP and UDP socket connections and then this issue has been mitigated globally for GAE and Cloud Run.

The product engineering team has added a internal process and validations to avoid further similar situations, so there is no risk of re-incidence.

ahmetb commented 3 years ago

I don't think such service warnings are a fit for this repo. I did not follow this outage very closely, however, serverless platforms need to just work and the entire point is to not worry about what's inside the black box. If you are impacted by what's in the black box, we're likely already working on preventing this from happening again. Hope that helps.

Sytten commented 3 years ago

I don't agree with this vision considering the level of detail already contained in the FAQ (an ABI contract running on gVisor is as important a contract as a network architecture) and Cloud Run is not only used by beginners that don't care how the thing runs. But it's your doc, your choice :) It would do GCP good to be a bit more transparent on how stuff is built/run.

ahmetb commented 3 years ago

I work on Cloud Run for over a year now and I am hearing about RSG for the first time here. I don't think those details were supposed to be shared with customers in the first place, as our container contract is documented as you said in your comment. Problems happening anywhere else is a bug and we conduct work and post-mortems to prevent them from happening again.

I also would like you to remind you that this is a community-maintained repository. So if you are looking for implementation details here, this is not the right place.

ahmetb / cloud-run-faq

Warning on internal RSG can cause trouble #136