firebase / firebase-functions

Firebase SDK for Cloud Functions
https://firebase.google.com/docs/functions/
MIT License
1.02k stars 201 forks source link

Error: The request was aborted because there was no available instance. #962

Closed dimavedenyapin closed 3 years ago

dimavedenyapin commented 3 years ago

This is happening in production firebase environment with a Blaze subscription, I've started seeing the error The request was aborted because there was no available instance. since 22nd August 10pm GMT+8. This error happens across all functions when I make 100+ invocations. When the error appears it affects all other functions as well (see screenshot). Can happen with any function with maxInstances parameter set or without it.

All functions are deployed in us-central1 Quotas doesn't seem to reach the limit.

Related issues

[REQUIRED] Version info

node: v12.22.3

firebase-functions: 3.14.1

firebase-tools: 9.16.0

firebase-admin: 9.11.0

[REQUIRED] Test case

Firebase pubsub listener

[REQUIRED] Steps to reproduce

Send 100+ messages to the firebase pubsub

[REQUIRED] Expected behavior

Functions execute.

[REQUIRED] Actual behavior

Functions failing with a message: The request was aborted because there was no available instance. image

Were you able to successfully deploy your functions?

successfully deployed

google-oss-bot commented 3 years ago

I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.

MrAlek commented 3 years ago

We're also experiencing this since August 23rd 6 AM CEST running on Node 14 instances. Same error, across multiple functions on europe-west1.

satvikreddy commented 3 years ago

We are on Blaze subscription. We're also experiencing this since August 22rd running on Node 12 instances on asia-south1 Same error across almost all functions. Both onCall and onRequest functions failing.

Tried upgrading to Node 14, and setting minInstances=1 and maxInstances=5. Error still persists.

image

dimavedenyapin commented 3 years ago

This is still happening for me. It start at ~10pm GMT+8 and ends after midnight. Initially I thought of migrating to another region, but it seems asia, europe and us are all affected.

Found that it was happening before in July: https://stackoverflow.com/questions/68284263/google-cloud-function-the-request-was-aborted-because-there-was-no-available-ins

taeold commented 3 years ago

Thanks for reporting the issue here. I noticed that there were several GCP support ticket opened w.r.t. to this issue, and will post any relevant updates here for wider audience as the GCF team makes progress on the issue.

dvrfluxchat commented 3 years ago

Its happening in production We're also experiencing this since August 23rd running on Node 12 instances on asia-south1.

It doesn't have any correlation to the load, I run a hourly pub sub function which also fails frequently. Critical messages are being dropped because of this.

duck-dev-go commented 3 years ago

image

We are also getting this error on europeWest1 There are just 2 people testing the application right now and this already happens. In about a week it will need to scale to thousands of people, is there anything I can do about this?

stoy commented 3 years ago

Same error for me. Functions randomly return The request was aborted because there was no available instance. . Until now, never had this error.

wneild commented 3 years ago

Same issue since ~ 2021-08-26 20:00 BST,

Node 14, firebase-functions 3.14.1 firebase-admin 9.10.0

It is happening with very minor spikes of requests < 100 across all function deployments.

I'd also like to understand why this very impactful issue being reported by many people isn't reflected on https://status.cloud.google.com/ as being investigated.

Edit: Looks like this is already being tracked here https://issuetracker.google.com/issues/194948300

robin-whg commented 3 years ago

Same here. The requests take way too long. A simple function with a single Firestore write takes up to 6000ms if it isn't aborted. Europe West 1

dimmetrius commented 3 years ago

Same issue since 2021-08-23. Pubsub triggered functions this error message

damienromito commented 3 years ago

Same issue, since this morning, and it's getting worse !!! I tried to

I sent a bug report to Firebase...

t0mstah commented 3 years ago

Same issue.

taeold commented 3 years ago

Hi everyone.

Google Cloud Function (GCF) users as a whole are reporting the same issue described here, and https://issuetracker.google.com/issues/153207649#comment3 is the official response from the GCF team.

tl;dr GCF nodejs runtime used to silently drop requests when instance couldn't be scaled fast enough to respond to demand. Now it's logging the failed request on your project's log, hence the sudden appearance of the issue (release note). For pubsub-triggered functions, this error is usually handled gracefully by automatic retry mechanism in the GCF infrastructure. The same can't be said of HTTP-triggered functions, and the request would have been dropped by the client unless a retry mechanism was already implemented.

To reduce occurrence of the once invisible but now transparent "aborted because there was no available instance" errors, recommendations in https://cloud.google.com/functions/docs/troubleshooting#scalability applies.

I hope this clears up the confusion a bit. I'll leave this ticket open to answer any follow up questions, but since this problem is directly related to Google Cloud Functions and not specific to Firebase Functions, please consider reaching out to GCP support with your project-specific questions.

larssn commented 3 years ago

@taeold But could you clarify what happens to event-driven functions, such as firebase/firestore triggers? The docs here guarantees at-least-once execution.

I'm hoping that is still the case. Only few of our functions are idempotent enough to warrant enabling the retry policy.

taeold commented 3 years ago

@larssn At-least once guarantee applies to all event-driven functions. Are you seeing events from Firebase/Firestore being dropped on your project?

larssn commented 3 years ago

@larssn At-least once guarantee applies to all event-driven functions. Are you seeing events from Firebase/Firestore being dropped on your project?

It's not possible to see what data exactly is affected, due to the nature of the error message. I'm just hoping that our triggers are retried.

If so, then I'm wondering why it's necessary to show it as an error, couldn't an "info"-level message suffice?

taeold commented 3 years ago

@larssn Same thoughts on the error message. I think this is first step on exposing the log to GCF users, and the team is working on a fix to output the log as a warning if it is going to be retried.

dgobaud commented 3 years ago

We are also seeing this https://github.com/firebase/firebase-functions/issues/965 started US central August 24

it was flooding our logs we had to exclude all such messages from alerts

google Case 28781796

dimavedenyapin commented 3 years ago

@taeold thanks for your research and updates 👍 I am glad to hear that it still guarantees the At-least once guarantee applies to all event-driven functions.

Looking forward to the fix the log outputs.

npicouet commented 3 years ago

I don't really understand how its only a silent vs logging problem. Our app has been running without issue for months, and only NOW are we getting these issues that actually are preventing some functions from running. It feels to me like something actually is going on or has changed, because we have not changed our backend for a while and its been running smoothly until now.

deepak786 commented 3 years ago

Earlier there were timeout errors but now we are getting these errors. Screen Shot 2021-09-03 at 9 54 26 PM

taeold commented 3 years ago

@npicouet @deepak786 Are these coming from HTTP functions or event-triggered functions? I've been told that these errors are "okay" for event-triggered functions in the sense that Google Cloud Infrastructure will retry the failed functions invocations. So your events aren't actually being dropped and your application is still guaranteed at-least once delivery of messages.

I'm sounding like a parrot, and I'm sorry for doing this over and over again - if you are seeing serious issues, please go ahead and contact Google Cloud Support. I'm not aware of any outage that explains the issues being reported here.

deepak786 commented 3 years ago

Earlier there were timeout errors but now we are getting these errors. Screen Shot 2021-09-03 at 9 54 26 PM

In this screenshot, all the functions are event-triggered functions.

pedronieto84 commented 3 years ago

In my case the function that has that error message is a function triggered by a "firestore trigger" ".onCreate".

Gbuomprisco commented 3 years ago

This started happening today to me, HTTP Functions

markgoho commented 3 years ago

Adding in a log that shows the instance not available, but the retries and eventual successful run of a scheduled cloud function: image

Is everyone only seeing the failures with no successful run?

duck-dev-go commented 3 years ago

I have a http function and when I litterally call it twice at the same time, it already shows this error. This happens every time. This could never work for thousands of users.

petermiles commented 3 years ago

Started happening to us yesterday for HTTP functions as well. Across our environments, this is happening quite often.

sanekyy commented 3 years ago

save issue:(

ChrisForeman commented 3 years ago

Started at 4:30PM EST for me. I'll be in hot water if this persists 😬

I'm confused because the bug tracker says:

"If you have an HTTP triggered function, for this error in particular, in the past you would always receive an error message that was sent back to the calling client but it was not logged within the customer's/user's project"

but the only reason I checked the logs to find this error was because I couldn't understand why all of a sudden our production client app kept getting an error > 50% of the time.

hatboysam commented 3 years ago

I am also experiencing this issue.

taeold commented 3 years ago

Friendly reminder - please reach out to Google Cloud Support if your HTTPS (not event triggered) functions are having problems.

We are keeping this issue open to make it easy for users experiencing the issue described in the original post to understand what's going on and to make the right next steps. Unfortunately, this issue cannot be fixed by code changes in this repository.

gbourne1 commented 3 years ago

Same here - multiple functions keep getting: "The request was aborted because there was no available instance"

Hivemind9000 commented 3 years ago

Happening to me too since the 31st of August, but has dramatically increased in the past few hours (all functions seem to be failing now). Almost all of my cloud functions are failing, and most are non-event triggered (http onRequest or onCall).

All running on Node 10, with no maximum instances set.

Have logged an issue with Google here:

https://issuetracker.google.com/u/1/issues/199180393

Quick Fix:

For anyone having this issue, a full redeploy of our cloud functions seems to have resolved the issue for now (errors gone, monitoring service reporting all functions as up/ok). I will monitor and report back if they reappear.

sebagutierrez commented 3 years ago

We've had this issue for a while now. It seems to be happening more often lately.

TrustyTechSG commented 3 years ago

Im having the same issue here. Start about 1 weeks ago, never seen this error in the pass 1 year of using firebase functions.

mnahta commented 3 years ago

Only started happening few hours earlier on cloud functions. Re deployed but doesn't seem to be working.

ToeFungi commented 3 years ago

Our cloud functions have been able to scale with no issue for over a year. Nothing has changed within our infrastructure but as of late last night 2021-09-07, our functions have begun to fail. This doesn't appear to be related to traffic, cold starts or long running executions. A request will be made to a function and it will fail on every request for several minutes. It will then begin to work and another function will begin to fail.

There definitely seems to be something larger going on here than just revealing logs to Stackdriver.

sanketplus commented 3 years ago

My theory is that they reduced tolerance for cold starts. For example: earlier they were ok waiting 2s for cold starting but now they throw the said error in just 1s. If you notice (or can create) a function with lil to no dependency, basically a helloWorld function, will not get affected by this.

This issue also aligns with announcement of min-instance for cloud function. Support recommended using this new beta feature but did not have an answer for the cause of this issue.

So likely they changed some configuration in the backend and this is an effect of that. Lot more people having production impact because of this here: https://issuetracker.google.com/issues/194948300 Hope they find and fix this soon 🤞

charlierushton commented 3 years ago

Happening to me too. No issues with scaling since we started our project in March 2020 until start of last week, when this issue happens every few minutes

Wtrapp commented 3 years ago

Same issue for us. We haven't made any changes to cloud functions.

taeold commented 3 years ago

Hot off the press - Google Cloud Functions in us-central1 did report some problem ~2021-09-08:

https://status.cloud.google.com/incidents/16SSwVXrYSLjy8fEMvyZ

The status report claims that the issue only affected functions deployed in us-central1 and that it is now resolved. If you are still seeing issues, please contact Google Cloud Support.

tolypash commented 3 years ago

I am getting the same issue. 1 year using cloud functions with no problem (europe-west3)

Now this is happening out of nowhere for the past 3 weeks. Contacted firebase support, they tried to tell me to reproduce this problem, however it is impossible to reproduce on demand because it is so random. I explain to them that this happens for all functions, for a few minutes each time (as if it's an outage). Not sure what to do right now :/

sanketplus commented 3 years ago

@tolypash not exactly the solution you'd like but consider moving out of google's ecosystem. This is a lesson learned hard. So far I have only heard about poor customer support but now I have witnessed it with this issue.

larssn commented 3 years ago

@sanketplus This is not the place for that.

This thread is dedicated to figuring out if there's a problem, and how to resolve it.

sanketplus commented 3 years ago

@larssn I get what you're saying. OP is saying support is not being helpful and so was the case with me when I was helping someone navigating this issue. I think suggestion/solution of moving out of this is pragmatic. More so when you are having a real customer impact which is making you lose money. You do not want to bet your company and its revenue to a cloud company which is having hard time determining if at all there is a problem.

Anyway, that is my personal take on this. You are welcome to disagree with it :)

Hivemind9000 commented 3 years ago

@tolypash The problem seemed to go away for 4-5 days (after Google said they found/fixed the issue causing the high frequency of errors) but is now back this morning. That said I've only had 5 errors this morning from approx 5000 calls, and nothing for the past 6 hours (from another 5000 calls) so not as frequent/severe as before. It may have been a transient load issue in the US central region.

I think it might be helpful to read this thread from the Google team:

https://issuetracker.google.com/issues/194948300#comment24 (posted today) https://issuetracker.google.com/issues/194948300 and https://cloud.google.com/functions/docs/concepts/exec#execution_guarantees

My reading is that these errors were occasionally occurring before (primarily due to load spiking), but not being thrown as an error. As noted in the issue tracker thread, event driven functions will retry on error, while http ones don't - it's up to your client to perform retries on errors. At the current error rate I am experiencing I think the retry strategy is reasonable (necessary even since Google state they don't guarantee HTTP executions, and the Internet in general being a relatively unreliable network).

I guess it depends on how many errors you are seeing, and the error ratio compared to successful GCF executions?

u007 commented 3 years ago

im having the same issue on asia-northeast1

YUTOPASO commented 3 years ago

I am getting the same issue on asia-northeast1.

All functions failed between 15:00 and 16:00 today.

My Cloud Function max_instances is set to no limit.