503 error returned from Cloud Run running "gcr.io/endpoints-release/endpoints-runtime-serverless:2.23" as base image

GoogleCloudPlatform / esp-v2

A service proxy that provides API management capabilities using Google Service Infrastructure.

https://cloud.google.com/endpoints/

Apache License 2.0

271 stars 169 forks source link

503 error returned from Cloud Run running "gcr.io/endpoints-release/endpoints-runtime-serverless:2.23" as base image #476

Open mejuhi opened 3 years ago

mejuhi commented 3 years ago

This issue has re-occurred in our prod environment.

Architecture is configured to connect cloud run and cloud functions using Cloud Endpoint using the following documentation.

ESPv2 image is running on Cloud run. Base image used by cloud run is gcr.io/endpoints-release/endpoints-runtime-serverless:2.23 which gets build using the following script

Cloud run is running with following configuration:

CPU allocated: 2 Memory allocated: 2048 Concurrency: 80 Request timeout: 800 seconds Auto-scaling: Min instances: 5 Auto-scaling: Max instances: 1,000

Frequency: We have received few errors yesterday and more few weeks ago (when base image used was gcr.io/endpoints-release/endpoints-runtime-serverless:2.22)

No significant logging can be seen on cloud trace as well. Attaching screenshot

Logs from the Cloud Run running ESPv2

It will be really helpful if you can provide with some information/explanation of this issue resurfacing. Thankyou!

qiwzhang commented 3 years ago

May I summarize the issue as:

ESPv2 deployed in Cloud Run intermittently could not talk to the backend deployed in Cloud Function.
The error frequency is: a few in a day.
With ESPv2 02.22 or later, ESPv2 has default retry policy policy as: retry_num: 1

I have re-opened the bug for the origin issue

qiwzhang commented 3 years ago

If your taffic is not big, less then 1 request per second. can you turn on debug mode? Hope to get detail debug info when the problem happen again.

You need to use ESPv2 flag --enable_debug

If you don't use any other ESPV2_ARGS flag, you can set it in CLI "gcloud run deploy ESPV2_SERIVCE" with flag

--set-env-vars=ESPv2_ARGS=--enable_debug  \

Or you have to modify the download the gcloud_build_image script as

cat <<EOF > Dockerfile
  FROM BASE_IMAGE

  ENV ENDPOINTS_SERVICE_PATH /etc/endpoints/service.json
  COPY service.json \ENDPOINTS_SERVICE_PATH

  ENV ESPv2_ARGS ^++^--cors_preset=basic++--cors_allow_method="GET,PUT,POST"++--enable_debug

  ENTRYPOINT ["/env_start_proxy.py"]
  EOF

Please see details on how to set ESPV2_ARGS

mejuhi commented 3 years ago

Since this issue is occurring in our production environment, it wont be feasible for us to add additional arguments. Can you provide us with some more information on _"With ESPv2 02.22 or later, ESPv2 has default retry policy policy as: retrynum: 1" and how can we increase the number of retry policy

mejuhi commented 3 years ago

After looking bit further, we discovered additional flag of backend_retry_num which defaults to 1 solved the original issue for us temporarily.

Do you think increasing the value of backend_retry_num flag to a higher number will help us to solve this issue again because of probability of request failing multiple times in reties becomes less

nareddyt commented 3 years ago

For this issue, it's not clear if ESPv2 is on the request path. ESPv2 creates spans in a specific format, documented here. The trace screenshot you provided in the issue does not match the ESPv2 format.

Are you sure this HTTP 503 is not from a different service your backend uses? If it's from your backend, then modifying ESPv2 flags will not help.

mejuhi commented 3 years ago

The trace screenshot attached is from the cloud run running from ESPv2 image and it is being used to call cloud function for backend processing. Not sure why there is a different format then documented. But i have confirmed the creenshot is picked up from the cloud run using esp image where we saw the error

Cloud Endpoint -> Cloud Run(running ESPv2 image) -> Cloud Function

I will try increasing the backend_retry_num to 2 and check if it solves issue.