apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.54k stars 1.17k forks source link

Action errors when invocation duration is longer than 2 minutes #1714

Closed eweiter closed 7 years ago

eweiter commented 7 years ago

Steps to reproduce the issue:

When trying to recreate a user issue the other day I stumbled upon what seems to be an error with the OW backend. When I created a NodeJS action that had a long execution duration (120000ms) and then created a trigger and rule for that action it produced an error once the action finished executing. If I changed the duration of the action to be shorter (2000ms) it worked. I created the action with a timeout of 300000ms (5min)

The error produced is The action did not produce a valid response and exited unexpectedly.

I used the following as my sample code.

function main(params) {
  var time = 120000;
  return new Promise((resolve) => setTimeout(resolve, time));
}
  1. wsk action create longAction ~/<pathToFile>.js -t 300000
  2. I then went to the OW UI and created a Periodic trigger at 4 minute intervals and then created a rule with that trigger and the above action.
  3. After the 4 minutes the trigger fired, the rule fired, and the action was invoked.
  4. After 2 minutes the action finished but it produced the error mentioned above.
  5. I updated the action to finish after 2000ms and then ran `wsk action update longAction ~/.js
  6. After 4 minutes, the trigger->rule->action fired. The action completed successfully after the 2s.
  7. Updated the action again to have a 2 minute duration again and the action failed after it was invoked by the rule again.

Here is the output of one of the failing invocations of my action

cordovaios:~/Dev/whiskStuff/ui$ wsk action invoke long
ok: invoked /_/long with id 51abb976f3934795b4a172ab76cdb032
cordovaios:~/Dev/whiskStuff/ui$ wsk activation get 51abb976f3934795b4a172ab76cdb032
ok: got activation 51abb976f3934795b4a172ab76cdb032
{
    "namespace": "eweiter@us.ibm.com_emptynew",
    "name": "long",
    "version": "0.0.6",
    "publish": false,
    "subject": "eweiter@us.ibm.com",
    "activationId": "51abb976f3934795b4a172ab76cdb032",
    "start": 1484256390062,
    "end": 1484256510167,
    "duration": 120105,
    "response": {
        "status": "action developer error",
        "statusCode": 0,
        "success": false,
        "result": {
            "error": "The action did not produce a valid response and exited unexpectedly."
        }
    },
    "logs": [
        "2017-01-12T21:26:30.125708738Z stdout: {}",
        "2017-01-12T21:26:30.125874727Z stdout: starting to wait"
    ],
    "annotations": [
        {
            "key": "limits",
            "value": {
                "logs": 10,
                "memory": 256,
                "timeout": 300000
            }
        },
        {
            "key": "path",
            "value": "eweiter@us.ibm.com_emptynew/long"
        }
    ]
}
eweiter commented 7 years ago

It appears that on further inspection it seems that this issue isn't related to triggers or rules. When I invoked solely the action with wsk action invoke longAction it waited the 2 minutes but then produced the same error as above.

So the steps to reproduce might simply be

  1. Create a wsk action with the code in the description above, set timeout to be 300000ms (5min).
  2. wsk action invoke that action you just created
  3. Inspect activation result and it should have failed.
rabbah commented 7 years ago

If your action is in fact the one above, then isn't this user error since your action is not actually returning a JSON object? setTimeout(resolve, time)) should be setTimeout(function() { resolve({}) }, time) no? (or does resolve() == resolve({})?)

That said, I'm able to reproduce the timeout after 2 minutes.

FYI @csantanapr

csantanapr commented 7 years ago

@rabbah doing resolve() it's valid result would be {} its true resolve() == resolve({}) Ref: https://github.com/openwhisk/openwhisk/blob/master/core/nodejsActionBase/runner.js#L111

if (typeof resolvedResult === "undefined") {
   resolvedResult = {};
}
resolve(resolvedResult);
csantanapr commented 7 years ago

I'm able to reproduce this problem, with similar code

If I set it to be done at 1.5 mins time = 90000; then it works fine. But if I set it to 2 mins time = 120000; then same problem as stated above.

rabbah commented 7 years ago

Thanks - we must have a recent regression; since we don't have a test that runs for the full 5 minutes :( The action is in fact getting the right deadline, so something is wrong around the boundary of the HTTP call to the container. I don't recall any changes in that part of the code recently... will need to inspect more.

When I closed this issue https://github.com/openwhisk/openwhisk/issues/1246 I ran a 5 minute action... or so I thought. Should've added a test then.

rabbah commented 7 years ago

I've confirmed the problem is either in the HttpUtils class or the container runtime with a standalone app running natively and communication with the container. Adding a keep alive strategy doesn't help. Focusing on the container next.

csantanapr commented 7 years ago

That was my thinking, a regression. We have a unit test but not one that test this long up to two minutes or even 5 minutes. I double check and was able to recreate local deployment on mac

rabbah commented 7 years ago

Shameless plug - found another use for this: https://medium.com/openwhisk/advanced-debugging-of-openwhisk-actions-518414636932#.vw69jm4db. 🔨

rabbah commented 7 years ago

tag teaming with @csantanapr, we found that server.timeout = 0 is missing from the node proxy runtime. Testing a fix - will add a test for jenkins but not travis this time: we'll need three instances, one for node, one for python (covers swift and native/docker), and one for java :(

Here is the proposed patch:

diff --git a/core/nodejsActionBase/src/service.js b/core/nodejsActionBase/src/service.js
index a505569..f92da47 100644
--- a/core/nodejsActionBase/src/service.js
+++ b/core/nodejsActionBase/src/service.js
@@ -63,6 +63,7 @@ function NodeActionService(config, logger) {
             var host = server.address().address;
             var port = server.address().port;
         });
+        server.timeout = 0; // disable automatic timeouts
     }

I suggest repurposing this test: https://github.com/openwhisk/openwhisk/blob/master/tests/src/whisk/core/cli/test/WskBasicUsageTests.scala#L514 to run for the max allowed duration instead of 3 seconds. I verified Python actions are OK - can run for 5 minutes unchanged. Need to add a test regardless. That leaves Java (@psuter might you know).

Verified to work for this action:

> cat t.js
function main(params) {
  var time = 300000 - 1000;
  return new Promise((resolve) => setTimeout(resolve, time));
}
> wsk action create t.js t.js -t 300000
ok: created action t.js
> wsk action invoke t.js
ok: invoked /_/t.js with id 9809d2e03c564bd38e9edb608926e598
# wait 5 minutes
> wsk activation get 9809d2e03c564bd38e9edb608926e598 duration
ok: got activation 9809d2e03c564bd38e9edb608926e598, displaying field duration
299744

@loganherr FYI.

csantanapr commented 7 years ago

We need to also look at how to improve the logging in invoker for this case where there is an exception from /run

Here are the logs I got, but was not able to tell what happened that the proxy code timeout

👻  $ cat /Users/Shared/wsklogs/invoker0/invoker0_logs.log | grep tid_395
[2017-01-13T04:08:37.233Z] [INFO] [#tid_395] [Invoker] [marker:invoker_activation_start:121]
[2017-01-13T04:08:37.236Z] [INFO] [#tid_395] [Invoker] guest/long guest 39b2b10313614790ab24bccc60bebab8
[2017-01-13T04:08:37.241Z] [INFO] [#tid_395] [WhiskAction] [GET] serving from datastore: id: guest/long, rev: 3-10841ba57f740e4972f502258efb8cc0 [marker:database_cacheMiss_count:129]
[2017-01-13T04:08:37.242Z] [INFO] [#tid_395] [CouchDbRestStore] [GET] 'csantana23_carlosmacbookpro_whisks' finding document: 'id: guest/long, rev: 3-10841ba57f740e4972f502258efb8cc0' [marker:database_getDocument_start:130]
[2017-01-13T04:08:37.285Z] [INFO] [#tid_395] [CouchDbRestStore] [GET] 'csantana23_carlosmacbookpro_whisks' completed: found document 'id: guest/long, rev: 3-10841ba57f740e4972f502258efb8cc0' [marker:database_getDocument_finish:173:41]
[2017-01-13T04:08:37.292Z] [INFO] [#tid_395] [ContainerPool] Getting container for guest/long@0.0.3 of kind nodejs:6 with 23bc46b1-71f6-4ed5-8c54-816aa4f8c502: myPos = 4 completed = 3 slack = 4 activeCount = 0 toBeRemoved = 0 startingCounter = 0
[2017-01-13T04:08:37.296Z] [INFO] [#tid_395] [ContainerPool] Obtained a pre-warmed container
[2017-01-13T04:08:37.301Z] [INFO] [#tid_395] [ContainerPool] Obtained cold container d6423ab2982d913f287959aa26d4ca7cd30ee39749eb6350a58d7feca16fa77e - about to initialize
[2017-01-13T04:08:37.302Z] [INFO] [#tid_395] [WhiskContainer] sending initialization to container [wsk0_22_warmJsContainer_20170113T040036434Z] [d6423ab2982d913f287959aa26d4ca7cd30ee39749eb6350a58d7feca16fa77e] [172.17.0.9:8080]
[2017-01-13T04:08:37.591Z] [INFO] [#tid_395] [WhiskContainer] initialization result: RunResult(Interval(2017-01-13T04:08:37.303Z,2017-01-13T04:08:37.591Z),Some((200,{"OK":true})))
[2017-01-13T04:08:37.593Z] [INFO] [#tid_395] [Invoker] sending arguments to guest/long@0.0.3 container [wsk0_22_warmJsContainer_20170113T040036434Z] [d6423ab2982d913f287959aa26d4ca7cd30ee39749eb6350a58d7feca16fa77e] [172.17.0.9:8080] [marker:invoker_activationRun_start:481]
[2017-01-13T04:10:37.624Z] [INFO] [#tid_395] [Invoker] finished running activation id: 39b2b10313614790ab24bccc60bebab8 [marker:invoker_activationRun_finish:120512:120030]
[2017-01-13T04:10:37.639Z] [INFO] [#tid_395] [Invoker] log cursor has not advanced, trying 15 more times
[2017-01-13T04:10:37.642Z] [INFO] [#tid_395] [KafkaProducerConnector] sent message: completed[0][3]
[2017-01-13T04:10:37.649Z] [INFO] [#tid_395] [Invoker] posted completion of activation 39b2b10313614790ab24bccc60bebab8
[2017-01-13T04:10:37.743Z] [INFO] [#tid_395] [Invoker] recording the activation result to the data store
[2017-01-13T04:10:37.744Z] [INFO] [#tid_395] [ContainerPool] putBack returning container 19 delete = true completed = 4 slack = 3 maxActive = 4 activeCount = 1 startingCounter = 0
[2017-01-13T04:10:37.749Z] [INFO] [#tid_395] [WhiskActivation] write initiated on new cache entry
[2017-01-13T04:10:37.758Z] [INFO] [#tid_395] [CouchDbRestStore] [PUT] 'csantana23_carlosmacbookpro_whisks' saving document: 'id: guest/39b2b10313614790ab24bccc60bebab8, rev: null' [marker:database_saveDocument_start:120646]
[2017-01-13T04:10:37.785Z] [INFO] [#tid_395] [CouchDbRestStore] [PUT] 'csantana23_carlosmacbookpro_whisks' completed document: 'id: guest/39b2b10313614790ab24bccc60bebab8, rev: null', response: '{"ok":true,"id":"guest/39b2b10313614790ab24bccc60bebab8","rev":"1-06ec20b56cd4dfc1d18b516dd535d48a"}' [marker:database_saveDocument_finish:120673:23]
[2017-01-13T04:10:37.803Z] [INFO] [#tid_395] [WhiskActivation] write all done, caching id: guest/39b2b10313614790ab24bccc60bebab8, rev: null Cached
[2017-01-13T04:10:37.807Z] [INFO] [#tid_395] [Invoker] recorded activation
rabbah commented 7 years ago

@jasonpet want to take this one?