apache / trafficcontrol

Apache Traffic Control is an Open Source implementation of a Content Delivery Network
https://trafficcontrol.apache.org/
Apache License 2.0
1.08k stars 344 forks source link

CIAB certificate addition issues causes infinite loop on startup #3769

Closed rawlinp closed 5 years ago

rawlinp commented 5 years ago

I'm submitting a ...

Traffic Control components affected ...

Current behavior:

When starting up CIAB, sometimes it gets stuck in a loop where it continually calls the TO API with empty cert input:

trafficops_1       | ++ to-post api/1.4/deliveryservices/sslkeys/add '{
trafficops_1       |   "cdn": "CDN-in-a-Box",
trafficops_1       |   "certificate": {
trafficops_1       |     "crt": "",
trafficops_1       |     "csr": "",
trafficops_1       |     "key": ""
trafficops_1       |   },
trafficops_1       |   "deliveryservice": "demo1",
trafficops_1       |   "hostname": "*.demo1.mycdn.ciab.test",
trafficops_1       |   "key": "demo1",
trafficops_1       |   "version": 1
trafficops_1       | }'
trafficops_1       | ++ local t
trafficops_1       | ++ local data
trafficops_1       | ++ [[ -z {
trafficops_1       |   "cdn": "CDN-in-a-Box",
trafficops_1       |   "certificate": {
trafficops_1       |     "crt": "",
trafficops_1       |     "csr": "",
trafficops_1       |     "key": ""
trafficops_1       |   },
trafficops_1       |   "deliveryservice": "demo1",
trafficops_1       |   "hostname": "*.demo1.mycdn.ciab.test",
trafficops_1       |   "key": "demo1",
trafficops_1       |   "version": 1
trafficops_1       | } ]]
trafficops_1       | ++ [[ -f {
trafficops_1       |   "cdn": "CDN-in-a-Box",
trafficops_1       |   "certificate": {
trafficops_1       |     "crt": "",
trafficops_1       |     "csr": "",
trafficops_1       |     "key": ""
trafficops_1       |   },
trafficops_1       |   "deliveryservice": "demo1",
trafficops_1       |   "hostname": "*.demo1.mycdn.ciab.test",
trafficops_1       |   "key": "demo1",
trafficops_1       |   "version": 1
trafficops_1       | } ]]
trafficops_1       | +++ mktemp
trafficops_1       | ++ t=/tmp/tmp.g9weCBLU2B
trafficops_1       | ++ echo '{' '"cdn":' '"CDN-in-a-Box",' '"certificate":' '{' '"crt":' '"",' '"csr":' '"",' '"key":' '""' '},' '"deliveryservice":' '"demo1",' '"hostname":' '"*.demo1.mycdn.ciab.test",' '"key":' '"demo1",' '"version":' 1 '}'
trafficops_1       | ++ data='--data @/tmp/tmp.g9weCBLU2B'
trafficops_1       | ++ to-auth
trafficops_1       | ++ [[ -z https://trafficops.infra.ciab.test:443 ]]
trafficops_1       | ++ [[ -z admin ]]
trafficops_1       | ++ [[ -z twelve ]]
trafficops_1       | ++ cookie_current /tmp/tmp.aO15EzejCT
trafficops_1       | ++ local cookiefile=/tmp/tmp.aO15EzejCT
trafficops_1       | ++ [[ -s /tmp/tmp.aO15EzejCT ]]
trafficops_1       | +++ awk '/mojolicious/ {print $5}' /tmp/tmp.aO15EzejCT
trafficops_1       | +++ tail -n 1
trafficops_1       | ++ exp=1564721547
trafficops_1       | +++ date +%s
trafficops_1       | ++ cur=1564700655
trafficops_1       | ++ ((  1564721547 > 1564700655  ))
trafficops_1       | ++ return
trafficops_1       | ++ curl -k -LfsS -H 'Content-Type: application/json;charset=UTF-8' --cookie /tmp/tmp.aO15EzejCT -X POST --data @/tmp/tmp.g9weCBLU2B https://trafficops.infra.ciab.test:443/api/1.4/deliveryservices/sslkeys/add
edge_1             | + to-get api/1.3/cdns/name/CDN-in-a-Box/sslkeys
edge_1             | + to-auth
edge_1             | + [[ -z https://trafficops.infra.ciab.test:443 ]]
edge_1             | + [[ -z admin ]]
edge_1             | + [[ -z twelve ]]
edge_1             | + cookie_current /tmp/tmp.3SLWasQmrd
edge_1             | + local cookiefile=/tmp/tmp.3SLWasQmrd
edge_1             | + [[ -s /tmp/tmp.3SLWasQmrd ]]
trafficops_1       | INFO: routing.go:196: 2019-08-01T23:04:16.05390824Z: POST /api/1.4/deliveryservices/sslkeys/add handling (reqid 2440)
edge_1             | ++ awk '/mojolicious/ {print $5}' /tmp/tmp.3SLWasQmrd
trafficops_1       | DEBUG: plugin.go:178: 2019-08-01T23:04:16.05414433Z: DEBUG plugins.OnRequest calling 0 plugins
edge_1             | ++ tail -n 1
trafficops_1       | DEBUG: api.go:134: 2019-08-01T23:04:16.056513174Z: parsing request: validating: missing fields: certificate.key required; certificate.crt required; certificate.csr required

In the past, it's been thought that simply running docker-compose down -v to remove the created volumes and rm -r traffic_ops/ca to remove that directory should be enough to prevent this loop from occurring, but that doesn't seem to work for me every time.

Expected / new behavior:

When starting up CIAB, it should not get stuck in an endless loop of attempting to add empty certs via the TO API.

Minimal reproduction of the problem with instructions:

docker-compose -f docker-compose.yml up --build then if it does not get stuck in that loop above, ctrl-c the terminal, then run docker-compose -f docker-compose.yml down followed by docker-compose -f docker-compose.yml up --build again. It's likely to get stuck in the same loop.

Anything else:

Though it might not be the cause, https://github.com/apache/trafficcontrol/pull/3489 fixed some other related things, so it might be a good place to dive into for fixing this issue.

ocket8888 commented 5 years ago

I've actually found that, more often than not, removing the infrastructure/cdn-in-a-box/traffic_ops/ca directory is directly correlated with this issue. Typically I don't do that unless some cert problem prevents startup, but otherwise leave it alone and merely destroy volumes. Race condition between the write and a read? Would explain why generating them, taking it down, and starting again fixes the problem.

rawlinp commented 5 years ago

The issue seems to be that X509_GENERATION_COMPLETE is getting set before X509_DEMO1*. Setting X509_GENERATION_COMPLETE releases a lot of different processes to continue, one of them being the process that adds the certs into TO. So, that process will try to read undefined env vars in order to add certs into TO and get stuck in an infinite failure loop. It also seems that sometimes the X509_DEMO1* vars get set, but the actual files they point to on disk haven't been written yet. So, the "add certs to TO" process tries to read nonexistent files then gets stuck in an infinite failure loop trying to post empty data to the TO sslkeys endpoint.

It's interesting that you mention it seems to happen mainly when deleting the ca dir between restarts. That has been the standard advice for a while now, but I could see how that would have something to do with it. Either way, I think I'm going to add some better safeguards into the "add sslkeys to TO" process.

ocket8888 commented 5 years ago

I think the reason that became standard advice was because there was a time when the certs would expire, and you needed to delete the directory to fix that. But I think that Jeff fixed that by extending the cert lifetime by like 35 years or something, so it shouldn't be necessary anymore (I think)

rawlinp commented 5 years ago

Interesting, that is good to know. Would you mind reviewing the fix #3772 when you get a chance?