Nike-Inc / hal

hal provides an AWS Lambda Custom Runtime environment for your Haskell applications.
BSD 3-Clause "New" or "Revised" License
240 stars 13 forks source link

Fix error handling in pure runtimes #130

Closed kokobd closed 11 months ago

kokobd commented 1 year ago

There were two problems with pure runtimes:

  1. When returning Left from fallibleRuntime, Left appears literally in the response, and the lambda function exit successfully
  2. If the input can't be parsed from json, the error won't be caught properly. The error message will only appear in CloudWatch logs, but not reported back to AWS Lambda. Thus, it's hard for caller to determine what error happened. Fixes #108

Screenshot for the two problems, respectively:

image

Reasons for these problems:

  1. We didn't handle Either in falliableRuntimeWithContext
    
    fallibleRuntimeWithContext :: ToJSON result =>
    (LambdaContext -> Value -> Either String result) -> IO ()
    fallibleRuntimeWithContext = mRuntimeWithContext . fmap (fmap pure)

pureRuntimeWithContext :: ToJSON result => (LambdaContext -> Value -> result) -> IO () pureRuntimeWithContext = mRuntimeWithContext . fmap (fmap pure)


`Either String result` wasn't handled, and it was treated as a `ToJSON` value as a whole.

2. In `withInfallibleParse fn = either error fn . parseEither parseJSON`, the `Left` side of `Either` was thrown by `error` in a pure context. When we later call `try`, the value is still lazy and possibly not evaluated.

This PR addresses both problems.
kokobd commented 1 year ago

I don't quite see how this PR addresses the laziness fix you described in point №2. can you please spell that out for me? I would've expected to see some kind of force-ish function call in runtimeLoop.

There was no NFData constraint, so adding that would break backward-compatibility. The key is this line fallibleRuntimeWithContext fn = mRuntimeWithContext (\lc -> either error pure . fn lc). Here the type of error is String -> IO result. So the error is thrown whenever the IO action runs. I really want to create a custom exception type UserError (given that we are reporing the type of error as User to AWS), and use throwIO to replace error everywhere.

Can you please also add a changelog entry?

I'll.

IamfromSpace commented 1 year ago

Sorry for a slow response to this one, I was traveling and then came back to quite a bit.

This does look pretty appealing! You're right that it seems quite odd to include Left in the logs literally. And it is better to crash the runtime so that AWS recycles the container, since an error indicates that we may now be in an undefined state (though, this is dramatically less likely for pure runtimes, since IO likely isn't happening, maybe even not one?). I was just espousing the benefits of this approach, so I feel a bit silly for having missed it here.

Before the final go-ahead, I'm curious to discuss if this behavioral change would be noticeable to any possible consumers. If anyone was attempting to parse the error from their logs, their parsers would break. If there was anyone that consistently saw errors in a pure runtime, they might now see a performance hit as their containers get trashed and restarted vs reused. And I suppose, it is maybe worth considering that for pure runtimes, re-use may be preferable, just by virtue of being pure, so the container should still be in a valid state to keep processing new requests. Handling invalid JSON is definitely seems like a net improvement.

If any of these did seem like it could affect consumers, we could consider a breaking change, which could released right away for a case like this.

Curious to hear everyone's perspectives here.

JackKelly-Bellroy commented 1 year ago

We've only ever used mRuntimeWithContext in our stuff, and this PR came about because we tried to use pureRuntime and it behaved in surprising ways. Do you have a way to survey other hal-users?

kokobd commented 1 year ago

if this behavioral change would be noticeable to any possible consumers

This is a breaking change. The output of the lambda function was wrapped in {"Right": xxx} or {"Left": xxx} before for falliableRuntime, but the extra layer of object is removed in this PR.

they might now see a performance hit as their containers get trashed and restarted vs reused.

If they prefer not reporting errors to AWS, they can use pureRuntime and return Either e a.

JackKelly-Bellroy commented 11 months ago

Hi @IamfromSpace, is there anything we can do to unstick this?

IamfromSpace commented 11 months ago

Hey, absolutely, sorry I’ve been swamped as of late. But I can dedicate tomorrow to getting this out, and it’s well over due!

My plan is to flip the Aeson flag to false, and release another 1.0 in a way that’s seamless for the current stack LTS, but lets cabal and other users opt in easily. Then we’ll put this on top as a 1.1, with that flag set to true, so we’re in a good spot for the next LTS release.

Let me know if there’s any concerns with this plan, otherwise I’ll stick to it!

JackKelly-Bellroy commented 11 months ago

You might not need the flag any more. There's now an empty release of the compat package https://hackage.haskell.org/package/attoparsec-aeson-2.1.0.0 which depends on aeson >=1.4.1.- && <2.2. If that's in the stackage snapshots you might be able to get away with an unconditional dependency.

JackKelly-Bellroy commented 11 months ago

And thank you for coming back to look at this, especially so close to the end of the year.

IamfromSpace commented 11 months ago

You might not need the flag any more. There's now an empty release of the compat package https://hackage.haskell.org/package/attoparsec-aeson-2.1.0.0 which depends on aeson >=1.4.1.- && <2.2. If that's in the stackage snapshots you might be able to get away with an unconditional dependency.

Yeah, this absolutely works as an alternate approach, and it's what was implemented first. Ultimately, in trying the flag based approach, I just liked the result better than using the compatibility package. It over all seemed like more things "just worked" via this route. Long term, I also think flags may be a good way to handle both back porting changes while allowing trunk-based development. At some point I'll experiment more with this, where the current code base actually serves all current major releases, but flags set the correct interfaces or behaviors across them.

Ultimately, I think flags just seemed like the more flexible and sustainable pattern. Next time something like this happens, there's no need to wait on compatibility package release!

IamfromSpace commented 11 months ago

Okay, hey all, I've got the main blocker out of the way, and I've spent a good while making sure I really understand what's going with this change. I still want to tinker a bit more, and I'm going to prioritize getting this closed out over the next couple days.

So far my thinking on this has two angles:

  1. I think Runtime.Value.fallibleRuntime is just totally broken. It type checks because ToJson a => Either String a is also a valid ToJson instance, where instead of just a being passed into `Runitime.Value.mRuntimeit's the wholeEither. So not only would{ "Left": ... }be appearing in the error, but{ "Right": ... }would appear in the success case. It type checked, but just ultimately doesn't make sense. There may be some users out there who went, "hey, that's odd, it's all wrapped in aRight`" and then just unwrapped it. So we would want to release a 1.1 for this change. Overall, this fix is a definite improvement. I'm a bit embarrassed to have let this out!
  2. I'm less convinced on how to handle bad parsing. It is something that I've always felt is a bit awkward, but that there's no great way to do it regardless. The Value runtime was introduced to help here, where if parsing could go wrong, you would use that, and the other runtime was meant for cases where it just wasn't possible. I do think it's a bit silly to crash the container over a parser error, and it may make the most sense to introduce a Parse and Execution data type, as mentioned. I'm hesitant to have the different runtimes (within the same module) have different meta-behaviors. If mRuntime crashes on a bad parse, it seems like pureRuntime should too. But likely, neither of them should.

My feeling is that these two things should be separated. Since they both introduce a break, it might be worthwhile to try to release them at the same time though.

tl;dr: I think fallible runtime shouldn't be wrapped in an Either, but that it should crash--unless none of the runtimes crash.

Any additional thoughts or consideration folks want to throw out here?

kokobd commented 11 months ago

Re 2:

mRuntime doesn't crash the runtime on parsing in most cases.

image

I think it's probably because the input value is consumed in an effectful way, thus, try (fn (either error id eCtx) event) is able to catch that error even though it was thrown in a pure function.

tl;dr: I think fallible runtime shouldn't be wrapped in an Either, but that it should crash--unless none of the runtimes crash.

Having the behavior (crashing vs reporting user error) depends on lazy evaluation is a bug IMO. We should probably use withFallibleParse :: FromJSON a => (a -> b) -> Value -> Either String b everywhere we used withInfallibleParse :: FromJSON a => (a -> b) -> Value -> b.

Even if we do want the runtime to crash, we should explicitly do that (by throwing an exception that we don't catch intentially) - instead of relying on lazy evaluation.

kokobd commented 11 months ago

My feeling is that these two things should be separated. Since they both introduce a break, it might be worthwhile to try to release them at the same time though.

I'll create another PR.

IamfromSpace commented 11 months ago

Okay awesome. I think we can get the fallible changes through right away if you just want to update this one to just do that.

IamfromSpace commented 11 months ago

mRuntime doesn't crash the runtime on parsing in most cases.

Okay, I think I finally get it; sorry, I feel like I've been a bit dense on this one! I was mixing up errors, crashing the runtime, and container discarding. If an error is reported without exit, the process is reused. If the process exits, the process is restarted. In both cases the container is reused, and there's simply no way to force it to be discarded (to prevent use of an undefined or corrupted environment).

Personally, I think safer options (discarding the container and/or restarting the process) are better, but that's simply not how lambda containers work. They prioritize speed, and know that most applications simply won't corrupt the environment--and if you make your lambdas stateful, then you just have to be careful.

So, I totally agree then, fallible and pure runtimes should not crash the runtime environment--which is consistent with the other runtimes that parse. They don't crash, they simply report the error and keep on running. Which, for parsing, is definitely preferred anyway, we know the environment is fine to re-use.

Also agreed that we want to eliminate the possibility of laziness here for the parsing error all together. Good catch on how that interplay works.

I'm going to play with this PR as is, but I'm leaning towards now it just being the right approach. Sorry for the back and forth, I just really want to make sure I've understood the complete picture!

IamfromSpace commented 11 months ago

Appreciate the patience on this one. At this point, once the merge conflict is resolved, this one is good to go for me.

kokobd commented 11 months ago

At this point, once the merge conflict is resolved, this one is good to go for me.

Thanks, if there is no need to split this into two PRs.

~I'll test it in our private repo and if it works, we can merge this.~ The latest commit still works.

IamfromSpace commented 11 months ago

Thanks, if there is no need to split this into two PRs.

Yeah, no need. I originally just wanted to make sure that we were getting resolved stuff through while giving more time for discussion to open topics. But since these are ready and reasonably related, I think we're better off just going for it.

IamfromSpace commented 11 months ago

This went out yesterday as 1.1 🎉 Thanks much for this, this is a definite improvement.

I’d also like to explore longer term approaches for wringing out laziness so that these kinds of things can’t manifest in user code either. We don’t want things to be lazy across executions. I’ll open an issue on the topic.