firebase / firebase-functions

Firebase SDK for Cloud Functions
https://firebase.google.com/docs/functions/
MIT License
1.02k stars 201 forks source link

Error: The request was aborted because there was no available instance. #962

Closed dimavedenyapin closed 3 years ago

dimavedenyapin commented 3 years ago

This is happening in production firebase environment with a Blaze subscription, I've started seeing the error The request was aborted because there was no available instance. since 22nd August 10pm GMT+8. This error happens across all functions when I make 100+ invocations. When the error appears it affects all other functions as well (see screenshot). Can happen with any function with maxInstances parameter set or without it.

All functions are deployed in us-central1 Quotas doesn't seem to reach the limit.

Related issues

[REQUIRED] Version info

node: v12.22.3

firebase-functions: 3.14.1

firebase-tools: 9.16.0

firebase-admin: 9.11.0

[REQUIRED] Test case

Firebase pubsub listener

[REQUIRED] Steps to reproduce

Send 100+ messages to the firebase pubsub

[REQUIRED] Expected behavior

Functions execute.

[REQUIRED] Actual behavior

Functions failing with a message: The request was aborted because there was no available instance. image

Were you able to successfully deploy your functions?

successfully deployed

YUTOPASO commented 3 years ago

image This error is still going on.

YongjinK commented 3 years ago

@taeold You are lying. This issue is not about invisible warning. Our game service has been calling 100,000 https calls everyday without error for 1 month. We are just experiencing the same issue reported here since yesterday (asia-northeast1). We had to handle 6+ purchase failure cases for past 24 hours because of this issue. Be honest, Google. Tell us what you guys are doing. US -> EU -> South-east Asia (Singapore) -> North-east Asia (Japan). Something is happening.

dgobaud commented 3 years ago

We used to have a similar problem using Firebase for HTTP serving - not errors but cold starts causing HTTP requests to take 10+ seconds meaning our app would often hang loading looking like it crashed. Seems a change has turned what were cold starts into errors.

The problem with Firebase is there is no way to control cold starts unlike with AWS Lambda. Lambda is much smarter about scaling up and down and sending requests to existing instances whereas with Firebase it is more random. Eg having a pinger keeping an instance alive doesn't really do anything useful.

The solution is to stop using Firebase for HTTP... it really is very bad for it. Switch to App Engine and you can control the scaling a lot more and avoid these problems.

sanketplus commented 3 years ago

+1 @dgobaud we also tried to keep the functions warm, not helping anymore. Similar to @YongjinK's experience, the failure started happening suddenly with no change from our side. There is definitely something that was changed from google's side and they are failing to ack it and work on it.

Though a few individual think that this is not a solution, I +1 on stop using GCloud functions, more so when it is resulting into prolonged customer impact.

charlierushton commented 3 years ago

Hi all,

I last saw this issue about a week ago and posted in this thread, and I haven't seen it since.

All I did was edit the function in Google Cloud Console (not Firebase Console) and set the minimum instance to 1, rather than leaving min & max to empty, see below screenshot.

After setting this, I haven't seen this error in the functions log. Could be coincidence, I don't know, but this seems to have worked for me. I will post again if it occurs again.

Screenshot 2021-09-14 at 16 07 34
charlierushton commented 3 years ago

Apologies, my last post is not true. I have just filtered my log to see that other user's of my cloud functions are encountering this error, but not as often as when I first posted.

Screenshot 2021-09-14 at 16 12 01
sanketplus commented 3 years ago

@charlierushton that solution will work indeed. But then you start paying north of 6 USD per instance. And this is pre-GA feature, so no stabilities are guaranteed.

This is what customer support suggested to use. which beats the purpose of having serverless functions and pay for only what you use.

src: https://firebase.google.com/docs/functions/manage-functions#min-max-instances

Note: A minimum number of instances kept running incur billing costs at idle rates. Typically, to keep one idle function instance warm costs less than $6.00 a month. The Firebase CLI provides a cost estimate at deployment time for functions with reserved minimum instances. Refer to Cloud Functions Pricing to calculate costs.

dgobaud commented 3 years ago

It seems minInstances is a new thing to help with the cold start problem - I'm guessing still when scaling up something has changed so cold starts cause errors instead of just very slow returns

gbourne1 commented 3 years ago

It looks like the issue is back - dozens of "The request was aborted because there was no available instance."

Wtrapp commented 3 years ago

Yep. This issue is back for us too on us-central

hatboysam commented 3 years ago

I’m getting dozens of these errors again for HTTP functions, it seems even worse than before.

On Wed, Sep 15, 2021 at 12:14 AM Wtrapp @.***> wrote:

Yep. This issue is back for us too on us-central

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/firebase/firebase-functions/issues/962#issuecomment-919576041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHTPIFS7MVPMXC5ATIFKBTUB7JNBANCNFSM5CYAK4RA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

charlierushton commented 3 years ago

@sanketplus you are correct. checked my billing accout this morning to find 20GBP billed for setting min instances to 1 for 7 days. Set min instances to zero very quickly.

sceee commented 3 years ago

Can also confirm, last night this error occurred multiple times again on a production function running in us-central1.

The days before that (from last Sunday), those errors did not occur.

akshatflx commented 3 years ago

Have been seeing a lot of such errors in us-central1 since yesterday evening (Sep-14, GMT).

charlierushton commented 3 years ago

I can confirm after setting min instances to zero, the issue has returned, seeing a lot of errors in my logs. also receiving a lot of phone calls & support tickets from customers reporting issues they have not had before.

we are losing money processing refunds and spending time on customer support when this issue is above our head

customers reported time of the issue lines up with the time in the logs when the error occurs. surely must be an answer from google sometime soon?

taeold commented 3 years ago

I hear your frustrations and am truly sorry for all the disruptions.

This issue is affecting all Google Cloud Functions customer, not just Firebase users, and this issue isn't a communication channel monitored and used by the Google Cloud Functions team that's actively managing the customer issues.

Please follow https://issuetracker.google.com/issues/194948300 for latest updates, and if you have specific issues in your project you'd like to escalate please contact GCP support.

We are keeping this Github issue open to help redirect users to appropriate support channels. I'll do my best to escalate reports being made here, but again, GCP support is the best way to get attention of "Google".

shauryaaher commented 3 years ago

I'm kind of intermediate at using Cloud Functions for Firebase. Even though I didn't face this issue. I'd still like to know about what to do, if I face it in the future.

shauryaaher commented 3 years ago

I'm kind of intermediate at using Cloud Functions for Firebase. Even though I didn't face this issue. I'd still like to know about what to do, if I face it in the future.

BTW, my Functions are at us-central1.

Hivemind9000 commented 3 years ago

We had a resurgence of issues yesterday on US Central - about 12 throughout the day (from 15k requests). Implementing retries for HTTP requests helped ensure that we didn't have a service disruption for our customers. Today we've only had 2 errors show up in the past 6 hours (from about 4k).

It seems that Google has been struggling intermittently over the past few weeks to keep up with server demand, and I am not confident it will go away completely anytime soon. I recommend to everyone to implement a decent retry strategy in your client (if you haven't already). As stated in the docs, the HTTP requests are not guaranteed to go through, while event driven functions will be retried at least once (you can set a flag to keep retrying for 7 days in the function config, if needed).

Also consider setting the minimum instances to ensure you're critical functions will not have to cold start (and therefore request a server from the pool). As mentioned above, this comes at a cost of US$6/Mth/function. Might be worth it for some (we are doing this for a few of our most latency-sensitive functions).

Some further reading:

For Pubsub / background / event-based triggered functions the requests running into this issue will be retried automatically hence there is no loss of information in terms of requests reaching GCF.

For HTTP triggered functions , the client is responsible for retrying a request ( with recommended exponential backoff + retry methods ) that ends with this error message.

References:

[1] - https://cloud.google.com/functions/docs/concepts/exec#execution_guarantees [2] - https://cloud.google.com/functions/docs/bestpractices/retries#semantics_of_retry

Edit: This morning we only had 6 errors, but now we're now seeing hundreds of errors on US Central (still happening as at 6pm Aus).

eliezerbs9 commented 3 years ago

Just now experiencing it for the first time at us-cental1.

stevecode21 commented 3 years ago

Same issue here.

iocuydi commented 3 years ago

Same issue, just experienced on US central 1 on two different projects within the past hour

antoniooi commented 3 years ago

Please fix your Cloud Function's instance scalability issue ASAP. We're not going to pay for your error for US$6/Mth/Function just for the workaround. Please do not get the users to responsible for your failure to fulfill your GCP's "high scalability" commitment. If causing problem will lead to more profits, will there be a necessity to solve the problem? Stop asking the users to pay for the workaround solutions and stop saying it is "worth for the money". Just get it done and don't let people losing hope on Google. Thank you.

rajivsubra1981 commented 3 years ago

Same issue faced by us on us-central1 - I wonder if the workaround of $6/month/function is actually making the problem worse, with folks buying up instances they wouldn't actually need if there was availability now exacerbating the shortage of instances.

Hivemind9000 commented 3 years ago

Everyone, if you can please post that you are getting the errors on this thread here:

https://issuetracker.google.com/u/1/issues/199180393

More posts might hopefully get Google's attention...

samodadela commented 3 years ago

We are also seeing similar problems. Java cloud functions that are triggered by PubSub are logging warnings but in the end all messages were processed. BTW: the warnings don't stop even when max instances (in our case 10) is reached. image What's the purpose of the warning anyway? What can I do to get rid of it (apart from setting minInstances - which is not desirable)?

The more serious problem is that CFs triggered by Cloud Scheduler just error and do not trigger the cloud function. Our imports are triggered by a cron job every day. Before this we had 0 problems (for more than a year) with the job being triggered - it never failed to start. This week the jobs failed for 2 consecutive days.

Please check your statement that this is not causing problems. The number of posts here is an indicator that there is something fishy going on

crrobinson14 commented 3 years ago

This thread doesn't seem like it needs more "venting," but the linked issue-tracker posts don't allow commenting so I guess I have no choice but to chime in here.

As I see it, from a developer's perspective the problem here is twofold:

  1. Yes, retrying requests is generally a Good Thing. But despite some side references above, this "need" is very lightly documented in the official Firebase docs. There are lots of claims about how it infinitely scales, but nothing I can find about how it might fail. Please tell me what I'm missing from this doc: https://firebase.google.com/docs/functions/callable None of the docs include "exponential backoff / retry" wrapper code or any hint that this is going to be a regular problem, and it's certainly not a problem with other FaaS offerings. It's hard to avoid the impression that Google is shifting blame for a genuine problem.

  2. The whole "no change was made, all that's happened is we're logging something you never saw before" claim doesn't wash. As others have reported above, our app is now (last 48 hours, that's why I'm here) suddenly experiencing a dramatic increase in this type of failure. But it's not related to a workload spike. The function our app is calling is in development. It only gets called once an hour at most, by our development team. We're seeing about a 5% failure rate when we do, but it's certainly not workload related. 5% doesn't sound like a lot but we're lucky we caught it in dev, because if we deployed this now we'd be swimming in user complaints.

Honestly I think @samodadela's https://github.com/firebase/firebase-functions/issues/962#issuecomment-920677241 above is my biggest fear. Sure, we can go re-engineer our app to add exponential backoffs to all calls. But Firebase has no mechanism to do that for scheduled functions or firestore triggers. "Just add retry options" CAN'T be the final answer here.

dgobaud commented 3 years ago

Honestly I think @samodadela's #962 (comment) above is my biggest fear. Sure, we can go re-engineer our app to add exponential backoffs to all calls. But Firebase has no mechanism to do that for scheduled functions or firestore triggers. "Just add retry options" CAN'T be the final answer here.

luckily scheduled functions and triggers are guaranteed at least once working delivery it seems

sceee commented 3 years ago

I also second the "retrying HTTP requests is generally a good idea" but - as @crrobinson14 mentioned - based on the Firebase docs I did not realize this as something being crucial to each app.

Until now, I was more in a feeling like: of course, HTTP functions can fail to execute if something really goes wrong in GCP. Therefore, they might fail in 0.000x% of all executions because of such an error.

And for this very rarely expected case, until now I just did not handle this case in the applications code as when this happens, a user could retry the action that executes the call to the HTTP function.

But if we're now talking about magnitudes of several percent (or even more) of executions that suddenly fail because of this error, it's a whole other level than: "Usually the HTTP invocations work (expect some very rare outages)"

tolypash commented 3 years ago

I would also like to note that "retrying" HTTP requests from the clients side is not possible either, because this issue seems to be affecting all functions for a certain period of time (ranging from a few seconds to sometimes minutes)

So even when I retry on the client, there will be another error thrown unless retries are spaced out minutes apart, which is not possible for the client.

amitrao17 commented 3 years ago

I'm seeing this on a scheduled function that runs once an hour. it gave an error once today and once yesterday. it does appear to be automatically retried but the function should update the realtime database which is not happening. so, the function initially errors, then runs but not correctly...

skizzo commented 3 years ago

Same thing happening to all of my Firebase projects. Might as well find another job.. :(

dgobaud commented 3 years ago

@amitrao17 > I'm seeing this on a scheduled function that runs once an hour. it gave an error once today and once yesterday. it does appear to be automatically retried but the function should update the realtime database which is not happening. so, the function initially errors, then runs but not correctly...

what should update realtime database? we see same thing with crons error then they run correctly

amitrao17 commented 3 years ago

The scheduled function runs every hour, does a 'fetch' to an API to get the weather, then console.log the result. Finally, it calls a '.set' to write the data into the realtime database. there are 8 writes to the realtime database which return promises and the function returns all of them.

I know the function gets called from the console logs, and i also know the database doesn't get updated. So it appears that the function sometimes never runs (with the "Error: The request..." message), sometimes starts, but doesn't let the promises go to completion.

One other thing that is happening...sometimes the function fails to write to the database the first time it is called as well -- there are multiple occasions when the function clearly makes the API call but the database does not get updated values.

I could try to 'await' all the promises before returning...

dgobaud commented 3 years ago

@amitrao17 i think you should definitely await all the promises before returning - I'm guessing part of what is happening is whatever Google changed is killing functions much more quickly after they finish running which per the spec is correct or at least not unexpected. Maybe before Google let functions hang around longer so for example your unresolved promises had time to finish even though there was no guarantee of that

amitrao17 commented 3 years ago

@amitrao17 i think you should definitely await all the promises before returning - I'm guessing part of what is happening is whatever Google changed is killing functions much more quickly after they finish running which per the spec is correct or at least not unexpected. Maybe before Google let functions hang around longer so for example your unresolved promises had time to finish even though there was no guarantee of that

Interesting. I'm new to this (a hardware guy) so my understanding was that letting the promises get finished after returning was fine. Thanks!

rajivsubra1981 commented 3 years ago

Promises must be awaited or returned in any event to guarantee execution in cloud functions. This has been & is the expected behavior prior to this issue.

On Thu, 16 Sep 2021, 22:34 amitrao17, @.***> wrote:

The scheduled function runs every hour, does a 'fetch' to an API to get the weather, then console.log the result. Finally, it calls a '.set' to write the data into the realtime database. there are 8 writes to the realtime database which return promises and the function returns all of them.

I know the function gets called from the console logs, and i also know the database doesn't get updated. So it appears that the function sometimes never runs (with the "Error: The request..." message), sometimes starts, but doesn't let the promises go to completion.

One other thing that is happening...sometimes the function fails to write to the database the first time it is called as well -- there are multiple occasions when the function clearly makes the API call but the database does not get updated values.

I could try to 'await' all the promises before returning...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/firebase/firebase-functions/issues/962#issuecomment-921076535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKHLH4ZSVNBQBHZ363OP5KLUCIPSBANCNFSM5CYAK4RA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

amitrao17 commented 3 years ago

Promises must be awaited or returned in any event to guarantee execution in cloud functions. This has been & is the expected behavior prior to this issue.

Yes, I was aware to return them...but it turns out that my function was not declared 'async' so that's not good...

dgobaud commented 3 years ago

We originally ran our client facing HTTP service on Firebase before minInstances was supported.

It was completely unusable because of cold starts. Random requests would hit the cold start and take 10+ seconds to return basically killing our client performance. With Firebase there was no way to control this - the way it scales didn't support trying to keep an instance alive so eg a pinger which works with Heroku to keep an instance alive doesn't work.

We changed to App Engine with min_instances and the problem was fixed.

It isn't clear but it seems Google did something recently to perhaps kill functions more quickly and maybe not let http requests wait for cold start and now just error. I've looked at the Google Issue Trackers to me it seems they are just saying this is a logging change but that seems wrong from what people are seeing:

  1. actual http errors
  2. cloud functions seemingly being killed more quickly

I can't tell from above but setting minInstances to 1 should help - people seem to report it doesn't though so unclear. If you want this fixed ASAP since it seems Google isn't going to do anything since stuff is probably technically working under spec you should first set minInstances to 1 and if that doesn't work switch to App Engine for HTTP with min_instances 1.

DzTheRage commented 3 years ago

Unable to comment on linked issue tracker so might as well make my voice be heard here as well.

As other have stated in this thread, this issue is not just a logging issue and is causing actual problems for our clients.

My best guess is there is some scalability issues that have not been truly resolved on GCP's side that are causing cloud functions to fail such as this: https://status.cloud.google.com/incidents/16SSwVXrYSLjy8fEMvyZ

antoniooi commented 3 years ago

Don't tell me this is just another similar approach of Google, like how they started the Cloud Functions for free (with 2 mil generous invocation quota) via the nice packaging of Firebase, then later taking the Cloud Build and Registry Container as a reason to force everyone to move to Blaze Plan (and activate their billing + minor storage charge during GCF deployment), and later the US$6/Mth/Function to increase the instances or else your Cloud Functions no longer scale? So they'll say this is still considered "Pay as you Go" because your traffic is increased and you need to pay for more instances for your app to scale well? If this is the case, then this issue is more like a high level Google management issue than just a "technical issue". I hope Google isn't just making up the problem for their pricing structure to move one step forward for greater profitability.

mbleigh commented 3 years ago

Hi folks, we appreciate all the passionate feedback and we can tell how important this issue is to you all. The Cloud Functions team is well aware of the issue and several internal discussions and investigations are happening now. The best place to keep track of ongoing updates related to this error is here: https://issuetracker.google.com/u/1/issues/194948300

To reiterate what has been said before: there has been no intentional change in behavior to the scheduling of instances for Cloud Functions. What has changed recently is that a log line is being written when a request fails due to no instance being available. When that results in a 429 error, it's because all instances specified via maxInstances are saturated. When that log is a 500 error, it's because of internal unavailability of instances.

I see from many of your comments that this seems to be happening much more frequently than you expect, and on functions/instances with low traffic and no maxInstances set. If this is the case, please contact Google Cloud support as it will require digging into the specifics of what's happening in your project to diagnose.

I'm closing this thread to further comment as there's nothing specific to the Firebase SDK for Cloud Functions that can be done to address problems related to this error message. Myself and the other members of the Firebase team will continue to follow up on this internally to make sure that all efforts are being made to get to the bottom of any regressions in instance scheduling behavior.

mbleigh commented 2 years ago

Hey folks, we have identified and resolved a potential cause of increased errors of this type. A sampling of customers leads us to believe that these errors may have been reduced significantly. You will still see errors of this kind where requests previously failed silently, but these errors should be rare.

If you are still experiencing this error frequently, please reach out to support as outlined above.