Nike-Inc / hal

hal provides an AWS Lambda Custom Runtime environment for your Haskell applications.
BSD 3-Clause "New" or "Revised" License
241 stars 13 forks source link

Error in RuntimeClient.hs 69:7, Runtime exited with Error 1 #54

Closed devcoder007 closed 4 years ago

devcoder007 commented 4 years ago

Lambda logs:

error, called at src/AWS/Lambda/RuntimeClient.hs:69:7 in hal-0.3.0-:AWS.Lambda.RuntimeClient

RequestId: Error: Runtime exited with error: exit status 1 Runtime.ExitError

IamfromSpace commented 4 years ago

Thanks for reporting the issue! Looks like at the very minimum we’re gonna need to add some better error messages, once we track it down.

Questions for more context:

Was this running within a lambda, are these logs from CloudWatch?

What was the type of the event expected, and what was the JSON invoked with?

devcoder007 commented 4 years ago

Thanks for reporting the issue! Looks like at the very minimum we’re gonna need to add some better error messages, once we track it down.

Questions for more context:

Was this running within a lambda, are these logs from CloudWatch?

What was the type of the event expected, and what was the JSON invoked with?

Yes, this is running in Lambda with logs from CloudWatch. The Lambda is getting triggered by SQS which gets messages from all events created S3 bucket. The Lambda is performing some operations and saving it to back to a new S3 bucket

IamfromSpace commented 4 years ago

Very cool! We’ll try to get this sorted. And thanks for the details.

From digging in on this, I think that the likely issue is that the JSON decode is failing. In the current version, that all gets rolled up into the unrecoverable HTTP errors that should really never occur. That’s a bit silly, but just sort of an awkward problem, 99% of lambda use-cases the payload is reliably decidable, and a failure is fatal. But not always, and the error needs to be helpful.

The answer we came up with is yet to merge, but I can revisit it. Essentially we add an un-coerced runtime, that lets the user direct access to the JSON Value that came in. All other runtimes get built on top, and the decode error is now only fatal if you want it to be, and the error message is better.

Can you try the value-runtime branch and see the results? I think the messaging is better out of the box, and if you use the Value Runtime you can decode the payload by whatever means you like. I’ll give it a once over and see if we can push it through as well.

devcoder007 commented 4 years ago

Very cool! We’ll try to get this sorted. And thanks for the details.

From digging in on this, I think that the likely issue is that the JSON decode is failing. In the current version, that all gets rolled up into the unrecoverable HTTP errors that should really never occur. That’s a bit silly, but just sort of an awkward problem, 99% of lambda use-cases the payload is reliably decidable, and a failure is fatal. But not always, and the error needs to be helpful.

The answer we came up with is yet to merge, but I can revisit it. Essentially we add an un-coerced runtime, that lets the user direct access to the JSON Value that came in. All other runtimes get built on top, and the decode error is now only fatal if you want it to be, and the error message is better.

Can you try the value-runtime branch and see the results? I think the messaging is better out of the box, and if you use the Value Runtime you can decode the payload by whatever means you like. I’ll give it a once over and see if we can push it through as well.

Just to update, the current version I'm using is hal == 0.3.0 along with AWS_LAMBDA_RUNTIME_API == 2

IamfromSpace commented 4 years ago

Interesting, that could be the problem. The AWS_LAMBDA_RUNTIME_API env var is a private variable that injects the host and base path for the internal event server running as a sidecar in the lambda container.

That’s not a value you want to set, the provided runtime will do that for you. I’m not sure which one will take priority though.

It may makes sense to add guards around that var in particular. We can give insights that they may not be running in a Landa if it’s absent or they may have overwritten it if it’s not a URL. Not everything is detectable though.

Can I ask what steered you towards setting this variable? In theory it’s something a user of this lib shouldn’t even know about.

Edit: and it may not be the problem at all. There’s some chance that it’s ignoring what you gave it.

pwm commented 4 years ago

Just chiming in that we started getting the same error today, after running without issues for months:

START RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c Version: $LATEST
bootstrap: Unexpected Runtime Error:  Could not retrieve next event.
CallStack (from HasCallStack):
error, called at src/AWS/Lambda/RuntimeClient.hs:71:7 in hal-0.4.1-LBsjuOlqSjk7XvfjzBNc8y:AWS.Lambda.RuntimeClient
END RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c
REPORT RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c  Duration: 55.69 ms  Billed Duration: 100 ms Memory Size: 512 MB Max Memory Used: 75 MB
XRAY TraceId: 1-5f1896ba-cdab2e10b019cf087c5e42d0   SegmentId: 1e4e94cf4870b780 Sampled: true
RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c Error: Runtime exited with error: exit status 1
Runtime.ExitError

I did not yet have time to look into it but I'm wondering if anything changed underneath us, ie. with lambda itself?

IamfromSpace commented 4 years ago

Thanks @pwm that’s definitely interesting. I’ve got a couple lambdas in use, and they all checked out just fine.

Possibly it’s the SQS definition changed? Guarantees around Lambda events are not so clear. @pwm are you you using the SQS event too? Just trying to look for any commonality there. @dogonthehorizon do you still have SQS lambdas and if so are you seeing issues?

If that is it, it’s probably an easy patch to get out the door and unblock folks. And then we can try to get better error messages after.

devcoder007 commented 4 years ago

Just chiming in that we started getting the same error today, after running without issues for months:

START RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c Version: $LATEST
bootstrap: Unexpected Runtime Error:  Could not retrieve next event.
CallStack (from HasCallStack):
error, called at src/AWS/Lambda/RuntimeClient.hs:71:7 in hal-0.4.1-LBsjuOlqSjk7XvfjzBNc8y:AWS.Lambda.RuntimeClient
END RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c
REPORT RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c    Duration: 55.69 ms  Billed Duration: 100 ms Memory Size: 512 MB Max Memory Used: 75 MB
XRAY TraceId: 1-5f1896ba-cdab2e10b019cf087c5e42d0 SegmentId: 1e4e94cf4870b780 Sampled: true
RequestId: 80cf646d-59e1-44e6-9654-930dc4a56a8c Error: Runtime exited with error: exit status 1
Runtime.ExitError

I did not yet have time to look into it but I'm wondering if anything changed underneath us, ie. with lambda itself?

Thanks, @pwm for sharing. The same happened with us, the services are running for the last 8 months without error, started getting issue since yesterday.

pwm commented 4 years ago

@IamfromSpace Ours I'm afraid has no SQS. It's a "fat" lambda sitting behind API Gateway, writing to DynamoDB + S3 and talking to a few other internal lambdas and external APIs. It's also talking to Cognito. Region is eu-west-2 (London) in case that matters.

@devcoder007 From what I can tell it started happening to us today and seems pretty random :/ Luckily our system is not yet live although it'll be very soon.

devcoder007 commented 4 years ago

In our case, out of 500 files we're processing, it is creating the issue for random 5-10 files.

IamfromSpace commented 4 years ago

Oh, interesting. The randomness is even more surprising. @pwm did you manually define FromJSON instances or did you use the feature branch for the API Gateway events? Or are you doing a mapping in the APIG to a custom event?

@pwm @devcoder007 are either of you seeing long execution times for these? It should retry, so I’d expect at least 1-1.5s.

IamfromSpace commented 4 years ago

Okay, I just replicated this in my own lambdas—it’s definitely intermittent but appears widespread. Damn.

This gives me some ability to debug though, so that’s at least a positive. I’ll drill into this further and see what I can find.

I’m curious if other provided runtimes are facing this. I’ll maybe poke around.

mbj commented 4 years ago

@IamfromSpace I personally think its an unexpected status code when reading the lambda event that did not show up before.

And that should "actually" be retried. I've made a PR to add more verbosity: https://github.com/Nike-Inc/hal/pull/55

Running that PR in production right now to get the full status code that fails the check.

IamfromSpace commented 4 years ago

Yep, you're correct. It appears to be a 403:

Response {responseStatus = Status {statusCode = 403, statusMessage = "Forbidden"}, responseVersion = HTTP/1.1, responseHeaders = [("Content-Type","application/json"),("Date","Thu, 23 Jul 2020 00:06:07 GMT"),("Content-Length","68")], responseBody = Object (fromList [("errorType",String "InvalidStateTransition"),("errorMessage",String "Not Allowed")]), responseCookieJar = CJ {expose = []}, responseClose' = ResponseClose}

Seems like maybe next is being called before it's willing to serve the new event or done handling the prior one? That's odd, I don't remember ever seeing this behavior before, it does seem like something was changed, maybe more strict enforcement. I'd be very surprised if there was a race condition. It's a bit annoying that that's a 403, because that's not really a retry-able status in general. I'm not sure if there's a better answer than just retry on all non-200s.

Edit: I'm gonna try to get a patch out for this tonight, will do a slight bit more digging to ensure the patch doesn't do anything silly.

mbj commented 4 years ago

@IamfromSpace It would not be the first time "haskell is faster than expected" and exposing race conditions in established software.

Lets assume the lambda engine is implemented this way.

  1. Event received
  2. Concurrently: a) Start the container (or re-use if running) to make it pickup the new event b) put the event into the place where the container can pick it up

Now lets assume AWS did an infrastructure change making 2a) faster, and the before always present "race" with 2b) is now "won" by haskell (by chance). So haskell requests the next event before the 2b) actually made it available.

And AWS devs typically test against JS / dynamic languages that reach the RPC to get the event "slower". Hence it was not found.

Under this assumption the 403 should be retried, but also the AWS team should be notified they've got a potential race condition.

To verify my hypothesis:

mbj commented 4 years ago

BTW: I get this error more often on lower activity levels in the regions I deploy the lambdas too. Which would indicate its relative to the load of the lambda engine / physical hosts. And on lower hosts there is a higher chance to trigger the race condition I assume.

IamfromSpace commented 4 years ago

I've certainly noticed that typically the request for the next event is already being sent before the container sleeps. I would be surprised though if the Rust provided runtime wasn't also seeing something similar. In a cursory browse, I didn't see anything that would handle this.

My theory at this point is that the behavior changed in between receipt of a successful response and availability of the next event changed. Prior it was a network error (retries were added after observing this) and now it is a 403.

Infrequency does seem to have some sort of effect. Giving 30 seconds or so between invocations seemed to increase the likelihood of triggering the error. Hard to tell though. It at first seems a bit backwards, because why would a delay impact the GET which was actually issued at the tail end of the last invocation? It's possible though that the next event payload is a bit slower to become available if the container isn't used for a bit. Again, guessing here.

I've created a branch that retries on 403 which backports to all prior minor versions. I'd like to merge and publish those ASAP, but frustratingly, my tests with this branch have yet to trigger a single retry or error. I'd like to confirm before this is merged.

In the interim, these commits can be used and hopefully we can confirm that this covers it:

0.4.2: 24a093d43986b38236eeb8bdde706e5da494dec9 0.3.1: a35f0132aff86607074f5cb90b29f526cf7e77ea 0.2.1: 61feab9038bb731bb9c5cdc29e66901d16760f1c 0.1.3: fbee59fca87de8a5aa54c34ad8b2d4412df62043

IamfromSpace commented 4 years ago

This issue is making my head spin. I seem to have definitive logs that indicate that yesterday the GET next event was returning 403s, but as of yesterday it seems to be back to HttpException behavior now (I also see that retries in general could probably be avoided, but that's a different topic). As of late last evening I'm finding it essentially impossible to recreate the issue.

The init_error now does give 403s (seems like that call is redundant), but in theory it couldn't have been a 403 yesterday and resulted in the logs we saw.

@mbj @devcoder007 @pwm are you still seeing this error?

devcoder007 commented 4 years ago

@IamfromSpace I'm still getting this error

IamfromSpace commented 4 years ago

Shoot. Thanks @devcoder007. I’m still digging, but I don’t expect that PR to solve it at this point. I’ll try to keep updates coming as I have them.

IamfromSpace commented 4 years ago

Ah! Okay, this PR does work, I was able to observe it in action finally, and it backs up the original hypothesis--I've finally got a fully consistent picture here, so I'm confident that this can move forward and fix it. There's also some good follow-up issues that I'll make.

I was able to validate seeing from one GET next event:

There's a way to prevent timeouts (i'll make an issue), but like there would just be another Forbidden in its place. This just must be new, because we would have seen these issues prior. With the retry, we now get a success and prevent the erroneous exit.

Thanks for the patience and help everyone, I'm going to get these versions published ASAP and hopefully that resolves everything.

mbj commented 4 years ago

@IamfromSpace What is the resource you used to implement /next? I was searching and found multiple. I know some people in AWS I want to point to to get the docs fixed.

IamfromSpace commented 4 years ago

@mbj @pwm @devcoder007 0.4.2 and 0.3.1 are available in Hackage now. Prior the automation is a bit off and it will take me more time to deploy them. I'm fairly certain that things are backwards compatible (or have a very simple upgrade path) if you're on 0.2.x or 0.1.x.

@mbj In creating this I think I read every resource out there, partly because there's not a lot, haha. This is as close as it gets to a spec: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-api.html#runtimes-api-next and there are other sibling docs that have most of the details. Ideally it would have more information on status codes.

mbj commented 4 years ago

@IamfromSpace k, I was hoping you'd know "more" than these. Compiling my report to my AWS contacts, unfortunately not on the lambda team.

Thanks for your patches!

IamfromSpace commented 4 years ago

Alrighty, for silly reasons we're jumping straight to 0.1.4 and 0.2.2 (ultimately makes future backports easier and gets these out the door). They're now up on Hackage. Let me know if there's any issues! Hopefully these cover it and we can close this :)

pwm commented 4 years ago

@IamfromSpace many thanks for the super quick turnaround time on this issue! I've just deployed our lambdas with 0.4.2, will report back if we see any issues.

mbj commented 4 years ago

@IamfromSpace I got one new instance so far on 0.4.2.

error, called at src/AWS/Lambda/RuntimeClient.hs:71:7 in hal-0.4.2-2SU1Dr4yHmkE4Ndt3Z9Eaw:AWS.Lambda.RuntimeClient
mbj commented 4 years ago

But the relative frequency is far lower now.

IamfromSpace commented 4 years ago

Gotcha, thanks the upgrades and report. I think I can fix that too, the lower frequency was important to see.

Before, we retried up to twice, and I’m fairly certain that the second try always succeeded. Since 403s are eating into the count now (and for an indeterminate amount of time), I’d considered upping the count—suppose I should have gone with my gut.

The real answer though I think is to add exponential timeout, and that’s not hard, so I’ll try to roll that out soon. The retry windows are so tight right now that they make actually be a contributor.

mbj commented 4 years ago

@IamfromSpace I added a bunch of debug prints in my local version and will report back with more info.

IamfromSpace commented 4 years ago

Alrighty, I feel like I've got a bit more insight into this. I think the core issue is that I should not have let any requests have timeouts, which is mentioned in the docs. At this point, my theory is that while the runtime used to answer multiple concurrent requests, it now will give a 403 to a second request while it's handling one. It doesn't have any issue handling sequential repetitions. I could probably test this further.

Tackling this is just a bit more invasive, because this means explicit management of a HTTP manager--but it is the plan! My expectation is that retries become will become basically moot point.

In the mean time, I was also able to spot scenarios that required 3 retries, and confirm that the current pull request fixes it. It looks like worst case scenario is hitting an old container after the 30s timeout window, very roughly seeing about 4% of >3 attempts for these invocations.

IamfromSpace commented 4 years ago

Kicked off the build for 0.4.3, should be available soon. Not backporting this one for now given that the timeout fix is coming too and figured I'd reduce noise. However, if anyone has a specific minor version they need supported, let me know, it's easy to do.

devcoder007 commented 4 years ago

@IamfromSpace I'm working with 0.3.1, so I need the fix for this version also if possible.

IamfromSpace commented 4 years ago

@devcoder007 done! Both 0.4.3 and 0.3.2 are available now. You may consider 0.4.3 as well, there's no breaking changes between. That said, my philosophy is to backport whenever possible, so always happy to add as needed.

mbj commented 4 years ago

@IamfromSpace My extended debug prints revealed the source of the problem: Timeouts.

The root fix is: Remove all timeouts on GET /next, and remove retries, no joke.

Let me elaborate:

"We think" that a lambda container only gets booted when there is an event to process. But this is actually not the case. It can happen that lambda boots a container, but does NOT dispatch an event to it than freezes the container for a longer time. This typically happens when the Lambda management engine predicts it needs more concurrency but was wrong, or when lambda simply wants to have more "ready to go" containers in stockpile.

Should such a "stockpiled" container than get a request: It gets unfrozen and the pending GET /next immediately times out, if the freeze was longer than the 30s default timeout. In this case the lambda engine actually tried to write the event on an unfrozen container, but hal as it was frozen for longer than 30s would refuse to read it, and instead re-try.

The lambda engine eventually detects that the runtime did not fully read the event, and re-dispatches it. If our retry goes in before such a re-dispatch we get the 403.

So you are correct: Remove the timeout, use an explicit connection manager. And remove the retries!

IamfromSpace commented 4 years ago

@mbj already working on it :) My recent tests repeatedly call a lambda every 32s, which reliably forced the issue, and observed the exact same pattern: Timeout, n * Forbidden, Success. I think concurrent requests were tolerated before, and recently forbidden.

I'm hoping to also tune the connection manager here as well. I'll leave the timeouts for the time being (can't hurt), but my expectation is that they'll basically never happen. Since we've at least got it under control, I'll take a bit more time with this.

I'm glad we're both thinking the same thing, that certainly makes me more confident in going forward. Thanks for diving into this as well!

mbj commented 4 years ago

I'm hoping to also tune the connection manager here as well. I'll leave the timeouts for the time being (can't hurt), but my expectation is that they'll basically never happen. Since we've at least got it under control, I'll take a bit more time with this.

My entire point is about the timeouts. If you remove them you can remove all retries. I'm running such a version already in production with 0% problem rate.

IamfromSpace commented 4 years ago

@mbj Understood! Just to clarify when I say "tune the connection manager," I'm talking about its connection reaping behaviors, not its timeouts.

I completely agree that removal of timeout likely makes the retries redundant. All the same, my plan is to keep them around for now, simply out of caution.

mbj commented 4 years ago

I completely agree that removal of timeout likely makes the retries redundant. All the same, my plan is to keep them around for now, simply out of caution.

I think this caution can backfire, as they should not be required "at all" they can hide other problems (and apparently did before).

IamfromSpace commented 4 years ago

It's a reasonable concern. My thoughts (and sort of a retrospective on this issue) are:

  1. I knew of the root cause, but did not fix it correctly in the first place. I knew about these timeouts (broadly, as network errors), so they weren't really "hidden." However, I failed to fully investigate and understand the problem. That's fully my responsibility. I took a lazy approach and said: "it's HTTP, of course it needs retries," and looked no further (but was quite rigorous otherwise, at least).
  2. I found my original download of the OpenAPI 3.0 doc, which did include a 403 on the GET at that time. I never observed it and assumed that it would only occur if the runtime was doing something odd (half-right). This probably should have been an explicitly handled status in the first place (and retry seems reasonable).
  3. I believe this is ultimately a behavioral contract break. The AWS documentation that specified no timeouts was added many months after this library was released. I know some would argue that this was always in the realm of undefined behavior and subject to change, so I won't push the point, but this falls under my personal definition of a breaking change.
  4. I'm relying a bit on memory here, but when this was released, retries only being required after the timeout had expired does not seem right. Many of my initial tests involved 1Hz short executions served by a single container. I maybe be misremembering, but I feel positive that errors were observed in these scenarios. Maybe retries aren't needed now, but this is inconsistent with my recollection of my testing and validations.
  5. I want this lib to be robust. I want to do everything I can to ensure that things simply don't go wrong. Not all will agree, but with a library like this, I would rather hide an error from users and allow the application to function--assuming that it can truly have no impact. I believe retries fit that bill. Even local HTTP can fail once in a while; I'd rather the user never see that. It may make sense for my personal applications to omit these safe-guards and be more instrumented (with a downside of not matching production code), so they are at least not hidden from me.

All these things steer me towards keeping the timeouts. I hope this at least a convincing case that it's not unreasonable to keep them, even if it's not the route you'd choose.

mbj commented 4 years ago

@IamfromSpace Thank you for your verbose elaboration. I'll keep my debug prints so far to see on how often retires trigger.