apache / openwhisk-deploy-kube

The Apache OpenWhisk Kubernetes Deployment repository supports deploying the Apache OpenWhisk system on Kubernetes and OpenShift clusters.
https://openwhisk.apache.org/
Apache License 2.0
300 stars 231 forks source link

Invoker fails to provision resources #546

Closed howespt closed 4 years ago

howespt commented 4 years ago

Might be related to #545

Managed to install openwhisk and all pods/services/etc running as expected.

Trying various options for running an action results in the same error. e.g. invoking a whisk.system sample action wsk -i action invoke /whisk.system/samples/greeting --result results in error: Unable to invoke action 'samples/greeting': The connection failed, or timed out. (HTTP status code 500)

Looking at the activation I see

wsk -i activation get 9b2f121039c74fe5af121039c7afe55c

ok: got activation 9b2f121039c74fe5af121039c7afe55c
{
    "namespace": "guest",
    "name": "greeting",
    "version": "0.0.1",
    "subject": "guest",
    "activationId": "9b2f121039c74fe5af121039c7afe55c",
    "start": 1573765402159,
    "end": 1573765402159,
    "duration": 0,
    "statusCode": 0,
    "response": {
        "status": "whisk internal error",
        "statusCode": 0,
        "success": false,
        "result": {
            "error": "Failed to provision resources to run the action."
        }
    },
    "logs": [],
    "annotations": [
        {
            "key": "path",
            "value": "whisk.system/samples/greeting"
        },
        {
            "key": "waitTime",
            "value": 997
        },
        {
            "key": "kind",
            "value": "nodejs:6"
        },
        {
            "key": "timeout",
            "value": false
        },
        {
            "key": "limits",
            "value": {
                "concurrency": 1,
                "logs": 10,
                "memory": 256,
                "timeout": 60000
            }
        }
    ],
    "publish": false
}

Digging into the logs everything seems fine except I see the following line from the controller pod:

[2019-11-14T21:03:22.200Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [ShardingContainerPoolBalancer] received completion ack for '9b2f121039c74fe5af121039c7afe55c', system error=true

The invoker pod logs show:

[2019-11-14T21:03:21.287Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [InvokerReactive]  [marker:invoker_activation_start:122]
[2019-11-14T21:03:21.332Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [WhiskAction] [GET] serving from datastore: CacheKey(whisk.system/samples/greeting) [marker:database_cacheMiss_counter:169]
[2019-11-14T21:03:21.333Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [CouchDbRestStore] [GET] 'test_whisks' finding document: 'id: whisk.system/samples/greeting, rev: 1-81d623acf60afe97d1b3c312eccacd66' [marker:database_getDocument_start:169]
[2019-11-14T21:03:21.367Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [CouchDbRestStore]  [marker:database_getDocument_finish:204:35]
[2019-11-14T21:03:21.399Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [WhiskAction] write initiated on existing cache entry, invalidating CacheKey(whisk.system/samples/greeting), tid HviouueynrGRdZZUNNR61Ng1Niw4ps0W, state WriteInProgress
[2019-11-14T21:03:21.401Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [WhiskAction] write all done, caching CacheKey(whisk.system/samples/greeting) Cached
[2019-11-14T21:03:21.403Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [ContainerPool] containerStart containerState: cold container: None activations: 1 of max 1 action: greeting namespace: guest activationId: 9b2f121039c74fe5af121039c7afe55c [marker:invoker_containerStart.cold_counter:240]
[2019-11-14T21:03:21.404Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [DockerClientWithFileAccess] running /usr/bin/docker run -d --cpu-shares 128 --memory 256m --memory-swap 256m --network bridge -e __OW_API_HOST=https://owdev-nginx.openwhisk.svc.cluster.local -e __OW_ALLOW_CONCURRENT=false --dns 10.100.0.10 --dns-search openwhisk.svc.cluster.local --dns-search svc.cluster.local --dns-search cluster.local --dns-search us-west-2.compute.internal --dns-option options --dns-option ndots:5 --name wskip-192-168-76-55.us-west-2.compute.internal2_4_guest_greeting --cap-drop NET_RAW --cap-drop NET_ADMIN --ulimit nofile=1024:1024 --pids-limit 1024 --log-driver json-file openwhisk/nodejs6action:1.14.0-incubating (timeout: 1 minute) [marker:invoker_docker.run_start:241]
[2019-11-14T21:03:22.045Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [DockerClientWithFileAccess]  [marker:invoker_docker.run_finish:883:641]
[2019-11-14T21:03:22.049Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [DockerClientWithFileAccess] running /usr/bin/docker inspect --format {{.NetworkSettings.Networks.bridge.IPAddress}} 5e7823f84c8154585bce8a92e1c74d01db1269c3d694a437e2d431ba7e2673c8 (timeout: 1 minute) [marker:invoker_docker.inspect_start:885]
[2019-11-14T21:03:22.158Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [DockerClientWithFileAccess]  [marker:invoker_docker.inspect_finish:996:111]
[2019-11-14T21:03:22.158Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [DockerClientWithFileAccess] running /usr/bin/docker rm -f 5e7823f84c8154585bce8a92e1c74d01db1269c3d694a437e2d431ba7e2673c8 (timeout: 1 minute) [marker:invoker_docker.rm_start:996]
[2019-11-14T21:03:22.161Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [CouchDbRestStore] [PUT] 'test_activations' saving document: 'id: guest/9b2f121039c74fe5af121039c7afe55c, rev: null' [marker:database_saveDocument_start:999]
[2019-11-14T21:03:22.182Z] [INFO] [#tid_sid_dbBatcher] [CouchDbRestStore] 'test_activations' saving 1 documents [marker:database_saveDocumentBulk_start:4997446]
[2019-11-14T21:03:22.184Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [MessagingActiveAck] posted combined of activation 9b2f121039c74fe5af121039c7afe55c
[2019-11-14T21:03:22.211Z] [INFO] [#tid_sid_dbBatcher] [CouchDbRestStore]  [marker:database_saveDocumentBulk_finish:4997475:27]
[2019-11-14T21:03:22.214Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [CouchDbRestStore]  [marker:database_saveDocument_finish:1050:51]
[2019-11-14T21:03:22.462Z] [INFO] [#tid_HviouueynrGRdZZUNNR61Ng1Niw4ps0W] [DockerClientWithFileAccess]  [marker:invoker_docker.rm_finish:1300:304]

From here I'm having difficulty tracing the error. Any ideas on where to proceed from here?

Additional context which may relate to #545 If I continue to try to invoke the actions then I put all of the invokers into an unhealthy state and get an unable to invoke error. Looking at the logs the health checks seem to work but the invokers remain in "unhealthy" status.

dgrove-oss commented 4 years ago

it might be worth re-deploying with the KubernetesContainerFactory (see https://github.com/apache/openwhisk-deploy-kube/blob/master/docs/configurationChoices.md#invoker-container-factory) and see if that helps. The fact that the invoker is doing a docker rm right after the inspect suggests that it thinks the container is unhealthy in some way.

howespt commented 4 years ago

Can confirm adding

invoker:
  containerFactory:
    impl: "kubernetes"

to the cluster config yaml fixed the issue. Thanks!

liusy58 commented 3 years ago

I wonder where to add the commands shown above