Convert all blocking activation requests to non-blocking after 60 seconds

rabbah commented 8 years ago

Currently invoking an action with a blocking request will hold the HTTP connection open between the client and controller up to the action timeout (which may be up to 5 minutes). Usually a blocking activation is used in a request/response context where a much short latency is expected.

Hence, change the controller handling of blocking activations to convert such requests to non blocking after a predefined duration (will make it 60 secs now, can reduce further in the future).

Additional blocking behavior can be deferred to the client; with additional enhancements to wsk activation poll in the case of the CLI for example.

mbehrendt commented 8 years ago

what's the motivation for us to do the conversion?

if i as a user start a blocking invocation, i expect it to have blocking behavior with the timeout i specified. it would feel like whisk is doing some unexpected tweaking with my request, if a 2min blocking request returns after 60 secs, but is still running in the background asynchronously.

rabbah commented 8 years ago

You get back an activation id and an accepted 202 code instead of the customary 200. You can check the activation later by its id. As noted in the description the blocking behavior can be recovered on the client side. This is no different than say our system is over loaded and has to queue your request anyway. In some deployments we can't hold the request open up to action timeout anyway without keep-alive since there may be overriding settings out of our control. Lastly blocking is for request/response. It should be fast - clients should not be doing blocking invokes on 5 minute actions.

mbehrendt commented 8 years ago

i agree with the behavior described here, but i think from a user perspective we should describe it differently:

there are action execution timeouts (what you can specify as a parameter for an action invocation).
for a blocking invocation, there is a timeout for the http connection blocking and waiting for the invocation to finish
if (action execution time + time waiting in the execution queue) is longer than the system-defined http connection timeout for blocking calls, then a 202 is returned.

besides that, we need to give guidance in the docs for how to implement a bullet-proof http client - which needs to highlight that it's not sufficient to wait for a 200, but also how to react (ideally with sample code) when a 202 is returned.

@csantanapr do you know whether we have a docs item open for this?

This is no different than say our system is over loaded and gas to queue your request anyway.

not sure what this would mean -- imho that isn't any different than an activation queued up for >60 secs and then the timeout kicks in.

csantanapr commented 8 years ago

@mbehrendt We don't document http client logic with samples. We currently only document using the CLI to invoke actions. I would not go down to document http client logic, then I will be documenting every possible client out there. The other client we document is iOS SDK to invoke an action, we can enhance the iOS SDK with the logic to handle the 202 and fall back to polling, give SDK consume the appearance of blocking call and returning the desired results regardless if (200 fast result) or (202 delayed result).

From an user perspective I see 3 main use cases interacting directly with OW API:

1 I want my function to run and I don't really care about correlating the call to the results, or I want to call multiple functions in parallel

Fire a trigger (fire and forget)

2 I want my function to run and I don't care when the work get's done, but I do care about retrieving the results for this specific run in the future.

Invoke non-blocking action, response is 200 with activation id, to go back in a later time and get the results or check that work is done and when it was done.

3 I want my function to run and I want to get the results as fast as possible, I want the results for this specific run as soon they are available You have a way for your function to notify your client async? 3a Yes Have function push the results to client somehow (i.e. websocket, db entry (puchdb), mobile notification, etc..) 3b No Have client get results thru OW http API Invoke a blocking action, then: Response is 200, results are included Response is 202, results are not included, but activation id for action is included, start polling for results using activation id Note: Many factors will influence for invoke to return results (200) or not(202), code your client logic to no assume that you will get a 200 with results all the time, have strategy to handle the case when 202 with activation id and fall back to polling.

(Don't think we should doc the 60 seconds when we describe this, it might be just in reference/systemlimits, but not something you mentioned up front. This way user codes his client logic regardless if its 30 secs, 45 secs, 65 secs)

@rabbah Did I capture accurately the new implementation? @mbehrendt Did I capture how do we want to articulate the use cases to user?

mbehrendt commented 8 years ago

@csantanapr the docs for 3b look right at first glance, while i wouldn't make ""You have a way for your function to notify your client async?" the top-level question. to me, 3b describes the 'complete' way to implement a blocking call.

the one concern i have with the 202 part is that the client-side code becomes much more complex -- instead of being able to rely on getting either a 200 or 50x, i now have to implement polling, which kind of defeats the purpose of having the convenience of a blocking call.......

rabbah commented 8 years ago

I want to clarify a point that may have been missed: we already have 202 responses. The client/caller already has to deal with it for either blocking and non-blocking requests. Blocking without timeout also doesn't make sense. I support the point made above to clarify that the system imposes a timeout on how long it will wait for the result in a request/response context independent of the action timeout.

mbehrendt commented 8 years ago

@rabbah thx for the clarification.

in which scenario is a non-blocking call returning a 202 today (is that maybe the default) ?

when are we returning a 202 today for blocking calls?

as we're going through this, it occurs to me that we should document all return codes in the swagger docs (couldn't find the return codes there, after a brief scan).

rabbah commented 8 years ago

a non-blocking call only returns 202: activation accepted. (in fact this is what the swagger documents: http://petstore.swagger.io/?url=https://raw.githubusercontent.com/openwhisk/openwhisk/master/core/controller/src/main/resources/whiskswagger.json#!/Actions/invokeAction 200 + activation record or 202 + activation id).

only a blocking call returns 200 if there is a result ready before the connection times out. so in these terms, this issue was started simply to shorten the system timeout from up to 5 minutes down to 60 seconds. all the other bits (enhancements to the documentation, to the swagger, etc. are needed with or without the reduction in system timeout on request/response).

our swagger needs a lot of TLC, there's some rot and no test coverage (this is the topic of epic #597).

mbehrendt commented 8 years ago

thx -- must have looked at an outdated version of the swagger docs.

re blocking calls: any thoughts on how we can make the client-side impl less cumbersome? with a 202 potentially returned by a blocking call, we're forcing clients to implement polling, which defeats the purpose of having a convenient blocking call? maybe provide a client-side lib?

rabbah commented 8 years ago

i don't understand the extra complexity being brought to bare here: are you willing to wait forever for a response? all our tests deal with this case: invoke an action, we need to check the result: so we poll until the result is ready and give up after some time and fail the test. it doesn't matter if you invoke the action request/response style or not. you have to deal with the eventuality that the response may not come back when you expect it. it's the caller's decision what to do.

we have many mechanisms to deal with this: wsk activation poll for example polls repeatedly for new activations. while (not fond) wsk activation get is another. we can add wsk activation poll id at some point. this isn't that complicated.

rabbah commented 8 years ago

some other thoughts we've had are to bootstrap whisk itself to trigger an event when an activation is ready.

mbehrendt commented 8 years ago

re 'extra complexity': in a traditional http env, you make a request, and wait for the response. the main cases are...it either succeeds (200), fails (50x) or times out. in any case, i can assume nothing keeps running in the background which might provide an update to the status i received as part of the response.

with the 202 behavior around blocking activations, i can't make that assumption. i have to use the activation id, go back to the server side and keep on polling to see if there is a status change (i might have to poll as 'forever' as waiting for a response for a blocking call). i as a developer implementing a client could equally decide to not use a blocking call at all -- instead, i could use non-blocking calls from the beginning and only do polling...which would be less code than having to handle both the response of a blocking call and doing polling.

the fact that we're doing polling today in code we wrote is imho not such a good argument, since that's coding complexity we took on our shoulders vs putting it on the shoulders of devs using openwhisk.

so yes...polling is not that hard, but it is additional coding effort for the dev, defeating to some degree the value prop of a blocking call, due to the increased amount of client-side logic.

i also don't have a perfect solution for how to solve this, so trying to lay out the problem stmt here so we can brainstorm on options to address this in the best possible way.

happy to be corrected. am i missing sth?

rabbah commented 8 years ago

202 accepted is a well known http status code. It means accepted but not finished processing. Are you saying we should never respond with 202? I'm having trouble following your line of argument or what your desired end goal is.

rabbah commented 8 years ago

Saying there is added complexity because of 202 while accepting that an http request may timeout or fail doesn't make sense because you can equally treat 202 as one of the error cases. Suppose we didn't return 202 and instead return 50x. Then what?

mbehrendt commented 8 years ago

it the 50x terminated the request (like in a traditional web runtime), i don't have to care about stuff continuing to run in the background, i.e. no additional client-side code for polling et al required (see my previous comment for details)

mbehrendt commented 8 years ago

i'm not trying to say we shouldn't respond with a 202 -- i'm saying 'there is a downside from a client-side dev perspective -- can we find an innovative solution to address it?'

apache / openwhisk

Convert all blocking activation requests to non-blocking after 60 seconds #1140