Cron triggers not firing; EPROTO error in alarmprovider pod

mluds commented 2 years ago

I have a cron trigger deployed in openwhisk:

$ wsk trigger get /whisk.system/getvolumes-trigger
{
    "activationId": "98bbc6046c7c4014bbc6046c7c00148c",
    "annotations": [
        {
            "key": "path",
            "value": "whisk.system/alarms/alarm"
        },
        {
            "key": "waitTime",
            "value": 366
        },
        {
            "key": "kind",
            "value": "nodejs:10"
        },
        {
            "key": "timeout",
            "value": false
        },
        {
            "key": "limits",
            "value": {
                "concurrency": 1,
                "logs": 10,
                "memory": 256,
                "timeout": 60000
            }
        },
        {
            "key": "initTime",
            "value": 117
        }
    ],
    "duration": 1448,
    "end": 1635966096634,
    "logs": [],
    "name": "alarm",
    "namespace": "whisk.system",
    "publish": false,
    "response": {
        "result": {
            "config": {
                "cron": "0 * * * *",
                "name": "getvolumes-trigger",
                "namespace": "whisk.system",
                "payload": {
                    "payload": ""
                },
                "strict": false
            },
            "status": {
                "active": true,
                "dateChanged": 1635951756721,
                "dateChangedISO": "2021-11-03T15:02:36Z"
            }
        },
        "size": 219,
        "status": "success",
        "success": true
    },
    "start": 1635966095186,
    "subject": "whisk.system",
    "version": "0.0.1"
}

However I get the following error in the alarmprovider pod when it tries to fire:

[2021-11-01T15:00:09.137Z] [ERROR] [??] [alarmsTrigger] [postTrigger] there was an error invoking xxxxxxxx/whisk.system/getvolumes-trigger {"message":"write EPROTO 140332583458624:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n","stack":"Error: write EPROTO 140332583458624:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n\n at WriteWrap.afterWrite [as oncomplete] (net.js:789:14)","errno":"EPROTO","code":"EPROTO","syscall":"write"}

This seems to indicate the wrong protocol is being used (http vs. https). The problem is I'm not sure which URL it's using for this request.

If I exec into the pod and try curling the internal URL it seems to work fine:

root@openwhisk-alarmprovider-f7ccbcc69-vgwg8:/# curl openwhisk-nginx.openwhisk-blue.svc.cluster.local   
{"api_paths":["/api/v1"],"description":"OpenWhisk","limits":{"actions_per_minute":200,"concurrent_actions":200,"max_action_duration":3600000,"max_action_logs":10485760,"max_action_memory":8589934592,"min_action_duration":100,"min_action_logs":0,"min_action_memory":134217728,"sequence_length":50,"triggers_per_minute":200},"runtimes":{"rust":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-rust-v1.34:1.2.0","kind":"rust:1.34","requireMain":false}],"ballerina":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-ballerina-v0.990.2:nightly","kind":"ballerina:0.990","requireMain":false}],"nodejs":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-nodejs-v10:1.18.0","kind":"nodejs:10","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-nodejs-v12:1.18.0","kind":"nodejs:12","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-nodejs-v14:1.18.0","kind":"nodejs:14","requireMain":false}],"java":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/java8action:1.16.0","kind":"java:8","requireMain":true}],"go":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-golang-v1.15:1.18.0","kind":"go:1.15","requireMain":false}],"php":[{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-php-v7.3:1.16.0","kind":"php:7.3","requireMain":false},{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-php-v7.4:1.16.0","kind":"php:7.4","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-php-v7.4:1.16.0","kind":"php:8.0","requireMain":false}],"python":[{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/python2action:1.13.0-incubating","kind":"python:2","requireMain":false},{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/actionloop-python-v3.7:1.16.0","kind":"python:3","requireMain":false}],"dotnet":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-dotnet-v2.2:1.15.0","kind":"dotnet:2.2","requireMain":true},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-dotnet-v3.1:1.15.0","kind":"dotnet:3.1","requireMain":true}],"ruby":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-ruby-v2.5:1.16.0","kind":"ruby:2.5","requireMain":false}],"swift":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-swift-v4.2:1.16.0","kind":"swift:4.2","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-swift-v5.1:1.16.0","kind":"swift:5.1","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-swift-v5.3:1.16.0","kind":"swift:5.3","requireMain":false}]},"support":{"github":"https://github.com/apache/openwhisk/issues","slack":"http://slack.openwhisk.org"}}

If I try the external URL, it can't verify the certificate. However, the certificate is valid, which I double checked. It also seems like the error is not caused by certificate validation.

root@openwhisk-alarmprovider-f7ccbcc69-vgwg8:/# curl https://openwhisk.sandbox.c2il.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.

subject= /CN=*.sandbox.c2il.org
notBefore=Oct 14 12:27:52 2021 GMT
notAfter=Jan 12 12:27:51 2022 GMT
serial=031F8FB5DC7F5D3991506B666B413762A02B

subject= /C=US/O=Let's Encrypt/CN=R3
notBefore=Sep  4 00:00:00 2020 GMT
notAfter=Sep 15 16:00:00 2025 GMT
serial=912B084ACF0C18A753F6D62E25A75F5A

subject= /C=US/O=Internet Security Research Group/CN=ISRG Root X1
notBefore=Jan 20 19:14:03 2021 GMT
notAfter=Sep 30 18:14:03 2024 GMT
serial=4001772137D4E942B8EE76AA3C640AB7

root@openwhisk-alarmprovider-f7ccbcc69-vgwg8:/# openssl x509 -enddate -noout -in /etc/ssl/certs/ISRG_Root_X1.pem 
notAfter=Jun  4 11:04:38 2035 GMT

I also tried checking the environment inside the pod to see if I could find which URL it's using.

root@openwhisk-alarmprovider-f7ccbcc69-rn2rd:/# env | grep OPENWHISK_CONTROLLER
OPENWHISK_CONTROLLER_SERVICE_HOST=10.43.133.144
OPENWHISK_CONTROLLER_PORT_8080_TCP_ADDR=10.43.133.144
OPENWHISK_CONTROLLER_PORT_8080_TCP_PROTO=tcp
OPENWHISK_CONTROLLER_SERVICE_PORT_HTTP=8080
OPENWHISK_CONTROLLER_PORT_8080_TCP_PORT=8080
OPENWHISK_CONTROLLER_PORT=tcp://10.43.133.144:8080
OPENWHISK_CONTROLLER_SERVICE_PORT=8080
OPENWHISK_CONTROLLER_PORT_8080_TCP=tcp://10.43.133.144:8080

I tried the 10.43.133.144 address and that also seems to work from the CLI:

root@openwhisk-alarmprovider-f7ccbcc69-rn2rd:/# curl 10.43.133.144:8080
{"api_paths":["/api/v1"],"description":"OpenWhisk","limits":{"actions_per_minute":200,"concurrent_actions":200,"max_action_duration":3600000,"max_action_logs":10485760,"max_action_memory":8589934592,"min_action_duration":100,"min_action_logs":0,"min_action_memory":134217728,"sequence_length":50,"triggers_per_minute":200},"runtimes":{"rust":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-rust-v1.34:1.2.0","kind":"rust:1.34","requireMain":false}],"ballerina":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-ballerina-v0.990.2:nightly","kind":"ballerina:0.990","requireMain":false}],"nodejs":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-nodejs-v10:1.18.0","kind":"nodejs:10","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-nodejs-v12:1.18.0","kind":"nodejs:12","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-nodejs-v14:1.18.0","kind":"nodejs:14","requireMain":false}],"java":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/java8action:1.16.0","kind":"java:8","requireMain":true}],"go":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-golang-v1.15:1.18.0","kind":"go:1.15","requireMain":false}],"php":[{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-php-v7.3:1.16.0","kind":"php:7.3","requireMain":false},{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-php-v7.4:1.16.0","kind":"php:7.4","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-php-v7.4:1.16.0","kind":"php:8.0","requireMain":false}],"python":[{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/python2action:1.13.0-incubating","kind":"python:2","requireMain":false},{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/actionloop-python-v3.7:1.16.0","kind":"python:3","requireMain":false}],"dotnet":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-dotnet-v2.2:1.15.0","kind":"dotnet:2.2","requireMain":true},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-dotnet-v3.1:1.15.0","kind":"dotnet:3.1","requireMain":true}],"ruby":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-ruby-v2.5:1.16.0","kind":"ruby:2.5","requireMain":false}],"swift":[{"attached":true,"default":true,"deprecated":false,"image":"openwhisk/action-swift-v4.2:1.16.0","kind":"swift:4.2","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-swift-v5.1:1.16.0","kind":"swift:5.1","requireMain":false},{"attached":true,"default":false,"deprecated":false,"image":"openwhisk/action-swift-v5.3:1.16.0","kind":"swift:5.3","requireMain":false}]},"support":{"github":"https://github.com/apache/openwhisk/issues","slack":"http://slack.openwhisk.org"}}

Any idea what might be going on, or how I can figure out which URL it's using?

dgrove-oss commented 2 years ago

It's possible that some of the changes in #698 would correct this. As part of that PR, I updated to use the latest openwhisk-package-alarms release and did test locally that the alarms provider was working.

I've just merged the PR. Maybe try again and see if it helped?

mluds commented 2 years ago

Thanks, I updated to the latest code. However, I'm still seeing this EPROTO error. It seems like it's trying to initiate an HTTPS connection using an HTTP URL.

Here are some logs from the alarmprovider pod:

[2021-11-05T16:44:35.882Z] [INFO] [??] [alarmsTrigger] [createDatabase] creating the trigger database
[2021-11-05T16:44:35.907Z] [INFO] [??] [alarmsTrigger] [server.listen] Express server listening on port 8080
[2021-11-05T16:44:36.118Z] [INFO] [??] [alarmsTrigger] [createDatabase] created trigger database: almalarmservice
[2021-11-05T16:44:36.376Z] [INFO] [??] [alarmsTrigger] [initAllTriggers] resetting system from last state
[2021-11-05T16:46:43.804Z] [INFO] [??] [alarmsTrigger] [setupFollow] got change for trigger xxxxxxxx/whisk.system/getvolumes-trigger
[2021-11-05T16:46:43.806Z] [INFO] [??] [alarmsTrigger] [scheduleCronAlarm] xxxxxxxx/whisk.system/getvolumes-trigger starting cron job
[2021-11-05T16:46:43.811Z] [INFO] [??] [alarmsTrigger] [setupFollow] xxxxxxxx/whisk.system/getvolumes-trigger created successfully
[2021-11-05T17:00:00.013Z] [INFO] [??] [alarmsTrigger] [fireTrigger] Alarm fired for xxxxxxxx/whisk.system/getvolumes-trigger attempting to fire trigger
(node:1) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
(Use `node --trace-warnings ...` to show where the warning was created)
[2021-11-05T17:00:00.043Z] [INFO] [??] [alarmsTrigger] [postTrigger] xxxxxxxx/whisk.system/getvolumes-trigger http post request, STATUS:
[2021-11-05T17:00:00.044Z] [ERROR] [??] [alarmsTrigger] [postTrigger] there was an error invoking xxxxxxxx/whisk.system/getvolumes-trigger {"message":"write EPROTO 139717772076864:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n","stack":"Error: write EPROTO 139717772076864:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n\n at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:94:16)","errno":-71,"code":"EPROTO","syscall":"write"}
[2021-11-05T17:00:00.044Z] [INFO] [??] [alarmsTrigger] [postTrigger] attempting to fire trigger again xxxxxxxx/whisk.system/getvolumes-trigger Retry Count: 1
[2021-11-05T17:00:01.050Z] [INFO] [??] [alarmsTrigger] [postTrigger] xxxxxxxx/whisk.system/getvolumes-trigger http post request, STATUS:
[2021-11-05T17:00:01.050Z] [ERROR] [??] [alarmsTrigger] [postTrigger] there was an error invoking xxxxxxxx/whisk.system/getvolumes-trigger {"message":"write EPROTO 139717772076864:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n","stack":"Error: write EPROTO 139717772076864:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n\n at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:94:16)","errno":-71,"code":"EPROTO","syscall":"write"}
[2021-11-05T17:00:01.050Z] [INFO] [??] [alarmsTrigger] [postTrigger] attempting to fire trigger again xxxxxxxx/whisk.system/getvolumes-trigger Retry Count: 2
[2021-11-05T17:00:02.081Z] [INFO] [??] [alarmsTrigger] [postTrigger] xxxxxxxx/whisk.system/getvolumes-trigger http post request, STATUS:
[2021-11-05T17:00:02.082Z] [INFO] [??] [alarmsTrigger] [postTrigger] attempting to fire trigger again xxxxxxxx/whisk.system/getvolumes-trigger Retry Count: 3
[2021-11-05T17:00:02.082Z] [ERROR] [??] [alarmsTrigger] [postTrigger] there was an error invoking xxxxxxxx/whisk.system/getvolumes-trigger {"message":"write EPROTO 139717772076864:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n","stack":"Error: write EPROTO 139717772076864:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:\n\n at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:94:16)","errno":-71,"code":"EPROTO","syscall":"write"}

I've verified the alarmprovider pod is running 2.3.0 (mirrored to our internal registry) and using node 14:

$ kubectl -n openwhisk-blue describe pod openwhisk-alarmprovider-bdd74df5f-5qm2j | grep "Image:"
    Image:         docker-registry.somedomain/busybox:latest
    Image:          docker-registry.somedomain/openwhisk/alarmprovider:2.3.0

root@openwhisk-alarmprovider-bdd74df5f-5qm2j:/# node --version
v14.17.2

Also here's the env if that helps at all:

root@openwhisk-alarmprovider-bdd74df5f-5qm2j:/# env
OPENWHISK_CONTROLLER_SERVICE_HOST=10.43.45.12
ENDPOINT_AUTH=openwhisk-nginx.openwhisk-blue.svc.cluster.local:80
YARN_VERSION=1.22.5
OPENWHISK_APIGATEWAY_SERVICE_HOST=10.43.158.177
OPENWHISK_CONTROLLER_PORT_8080_TCP_ADDR=10.43.45.12
DB_HOST=couchdb-svc-couchdb.couchdb-blue.svc.cluster.local:5984
OPENWHISK_APIGATEWAY_PORT_9000_TCP_PROTO=tcp
OPENWHISK_REDIS_PORT_6379_TCP_PORT=6379
OPENWHISK_NGINX_PORT_443_TCP_ADDR=10.43.108.19
OPENWHISK_APIGATEWAY_SERVICE_PORT=8080
OPENWHISK_APIGATEWAY_PORT_9000_TCP=tcp://10.43.158.177:9000
OPENWHISK_NGINX_SERVICE_HOST=10.43.108.19
OPENWHISK_APIGATEWAY_PORT_9000_TCP_PORT=9000
HOSTNAME=openwhisk-alarmprovider-bdd74df5f-5qm2j
OPENWHISK_CONTROLLER_PORT_8080_TCP_PROTO=tcp
OPENWHISK_REDIS_SERVICE_PORT=6379
OPENWHISK_APIGATEWAY_PORT_8080_TCP_PROTO=tcp
OPENWHISK_APIGATEWAY_SERVICE_PORT_API=9000
KUBERNETES_PORT_443_TCP_PROTO=tcp
OPENWHISK_REDIS_PORT_6379_TCP=tcp://10.43.235.11:6379
KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
OPENWHISK_REDIS_SERVICE_HOST=10.43.235.11
OPENWHISK_CONTROLLER_SERVICE_PORT_HTTP=8080
OPENWHISK_APIGATEWAY_PORT_8080_TCP_ADDR=10.43.158.177
KUBERNETES_PORT=tcp://10.43.0.1:443
OPENWHISK_NGINX_PORT=tcp://10.43.108.19:80
OPENWHISK_NGINX_PORT_80_TCP_PORT=80
PWD=/
OPENWHISK_CONTROLLER_PORT_8080_TCP_PORT=8080
HOME=/root
OPENWHISK_REDIS_PORT_6379_TCP_ADDR=10.43.235.11
OPENWHISK_CONTROLLER_PORT=tcp://10.43.45.12:8080
DB_PASSWORD=8qgqiFKazAZH9AWz
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_PORT=443
OPENWHISK_APIGATEWAY_PORT_8080_TCP_PORT=8080
ROUTER_HOST=openwhisk-nginx.openwhisk-blue.svc.cluster.local:80
NODE_VERSION=14.17.2
OPENWHISK_NGINX_PORT_80_TCP_ADDR=10.43.108.19
OPENWHISK_APIGATEWAY_SERVICE_PORT_MGMT=8080
KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
OPENWHISK_NGINX_PORT_80_TCP=tcp://10.43.108.19:80
OPENWHISK_NGINX_PORT_443_TCP_PROTO=tcp
OPENWHISK_APIGATEWAY_PORT_9000_TCP_ADDR=10.43.158.177
OPENWHISK_NGINX_PORT_443_TCP=tcp://10.43.108.19:443
OPENWHISK_APIGATEWAY_PORT=tcp://10.43.158.177:8080
OPENWHISK_CONTROLLER_SERVICE_PORT=8080
TERM=xterm
DB_USERNAME=admin
OPENWHISK_REDIS_PORT=tcp://10.43.235.11:6379
OPENWHISK_NGINX_PORT_80_TCP_PROTO=tcp
OPENWHISK_NGINX_SERVICE_PORT_HTTP=80
OPENWHISK_CONTROLLER_PORT_8080_TCP=tcp://10.43.45.12:8080
OPENWHISK_NGINX_SERVICE_PORT=80
SHLVL=1
OPENWHISK_NGINX_SERVICE_PORT_HTTPS=443
KUBERNETES_SERVICE_PORT=443
OPENWHISK_REDIS_SERVICE_PORT_REDIS=6379
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
OPENWHISK_REDIS_PORT_6379_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.43.0.1
DB_PREFIX=alm
OPENWHISK_APIGATEWAY_PORT_8080_TCP=tcp://10.43.158.177:8080
OPENWHISK_NGINX_PORT_443_TCP_PORT=443
DB_PROTOCOL=http
_=/usr/bin/env

dgrove-oss commented 2 years ago

ok. I will try to find time this weekend to verify the alarm provider still works for me locally. I had tested it while working on #698, but I also ended up doing and undoing several things in that commit to try to avoid needing #713 and its possible I ended up backing out something that fixed a problem with http/https in the alarm provider. I was experimenting with different options of how to configure that.

kostas-meladakis commented 2 years ago

I have the same EPROTO error, did we find any solution? I can create cron triggers, but when invoked i get this error. It seem a misconfig IP error or https protocol used or even -insecure tag not working or even --auth error Alarm-Error-log

s294547 commented 1 year ago

I have recently faced the problem and solved it. In my case, the problem was that in utils.js in the provider folder the uri of the apiHost is created in this way: this.uriHost ='https://' + this.routerHost;. I have deployed openwhisk using this helm chart, and if you look at the yaml of the alarm provider deployment (here) the ROUTER_HOST and ENDPOINT_AUTH environment variables are using the INTERNAL api host name and port. This is a problem, since the internal port is 80 and does not provide any security option, but we are using https in the uri. The code should be patched in order to check if the provided port is a secure one, if it is not the used protocol should be http.

mretallack commented 11 months ago

I have raised:

https://github.com/apache/openwhisk-package-alarms/issues/240

To connect the issues.

apache / openwhisk-deploy-kube

Cron triggers not firing; EPROTO error in alarmprovider pod #712