Open mejuhi opened 3 years ago
May I summarize the issue as:
I have re-opened the bug for the origin issue
If your taffic is not big, less then 1 request per second. can you turn on debug mode? Hope to get detail debug info when the problem happen again.
You need to use ESPv2 flag --enable_debug
If you don't use any other ESPV2_ARGS flag, you can set it in CLI "gcloud run deploy ESPV2_SERIVCE" with flag
--set-env-vars=ESPv2_ARGS=--enable_debug \
Or you have to modify the download the gcloud_build_image script as
cat <<EOF > Dockerfile
FROM BASE_IMAGE
ENV ENDPOINTS_SERVICE_PATH /etc/endpoints/service.json
COPY service.json \ENDPOINTS_SERVICE_PATH
ENV ESPv2_ARGS ^++^--cors_preset=basic++--cors_allow_method="GET,PUT,POST"++--enable_debug
ENTRYPOINT ["/env_start_proxy.py"]
EOF
Please see details on how to set ESPV2_ARGS
Since this issue is occurring in our production environment, it wont be feasible for us to add additional arguments. Can you provide us with some more information on _"With ESPv2 02.22 or later, ESPv2 has default retry policy policy as: retrynum: 1" and how can we increase the number of retry policy
After looking bit further, we discovered additional flag of backend_retry_num
which defaults to 1 solved the original issue for us temporarily.
Do you think increasing the value of backend_retry_num
flag to a higher number will help us to solve this issue again because of probability of request failing multiple times in reties becomes less
For this issue, it's not clear if ESPv2 is on the request path. ESPv2 creates spans in a specific format, documented here. The trace screenshot you provided in the issue does not match the ESPv2 format.
Are you sure this HTTP 503 is not from a different service your backend uses? If it's from your backend, then modifying ESPv2 flags will not help.
The trace screenshot attached is from the cloud run running from ESPv2 image and it is being used to call cloud function for backend processing. Not sure why there is a different format then documented. But i have confirmed the creenshot is picked up from the cloud run using esp image where we saw the error
Cloud Endpoint -> Cloud Run(running ESPv2 image) -> Cloud Function
I will try increasing the backend_retry_num
to 2 and check if it solves issue.
This issue has re-occurred in our prod environment.
Architecture is configured to connect cloud run and cloud functions using Cloud Endpoint using the following documentation.
ESPv2 image is running on Cloud run. Base image used by cloud run is
gcr.io/endpoints-release/endpoints-runtime-serverless:2.23
which gets build using the following scriptCloud run is running with following configuration:
CPU allocated: 2 Memory allocated: 2048 Concurrency: 80 Request timeout: 800 seconds Auto-scaling: Min instances: 5 Auto-scaling: Max instances: 1,000
Frequency: We have received few errors yesterday and more few weeks ago (when base image used was
gcr.io/endpoints-release/endpoints-runtime-serverless:2.22
)No significant logging can be seen on cloud trace as well. Attaching screenshot
Logs from the Cloud Run running ESPv2
It will be really helpful if you can provide with some information/explanation of this issue resurfacing. Thankyou!