Closed rawlinp closed 5 years ago
I've actually found that, more often than not, removing the infrastructure/cdn-in-a-box/traffic_ops/ca
directory is directly correlated with this issue. Typically I don't do that unless some cert problem prevents startup, but otherwise leave it alone and merely destroy volumes. Race condition between the write and a read? Would explain why generating them, taking it down, and starting again fixes the problem.
The issue seems to be that X509_GENERATION_COMPLETE
is getting set before X509_DEMO1*
. Setting X509_GENERATION_COMPLETE
releases a lot of different processes to continue, one of them being the process that adds the certs into TO. So, that process will try to read undefined env vars in order to add certs into TO and get stuck in an infinite failure loop. It also seems that sometimes the X509_DEMO1*
vars get set, but the actual files they point to on disk haven't been written yet. So, the "add certs to TO" process tries to read nonexistent files then gets stuck in an infinite failure loop trying to post empty data to the TO sslkeys endpoint.
It's interesting that you mention it seems to happen mainly when deleting the ca
dir between restarts. That has been the standard advice for a while now, but I could see how that would have something to do with it. Either way, I think I'm going to add some better safeguards into the "add sslkeys to TO" process.
I think the reason that became standard advice was because there was a time when the certs would expire, and you needed to delete the directory to fix that. But I think that Jeff fixed that by extending the cert lifetime by like 35 years or something, so it shouldn't be necessary anymore (I think)
Interesting, that is good to know. Would you mind reviewing the fix #3772 when you get a chance?
I'm submitting a ...
Traffic Control components affected ...
Current behavior:
When starting up CIAB, sometimes it gets stuck in a loop where it continually calls the TO API with empty cert input:
In the past, it's been thought that simply running
docker-compose down -v
to remove the created volumes andrm -r traffic_ops/ca
to remove that directory should be enough to prevent this loop from occurring, but that doesn't seem to work for me every time.Expected / new behavior:
When starting up CIAB, it should not get stuck in an endless loop of attempting to add empty certs via the TO API.
Minimal reproduction of the problem with instructions:
docker-compose -f docker-compose.yml up --build
then if it does not get stuck in that loop above, ctrl-c the terminal, then rundocker-compose -f docker-compose.yml down
followed bydocker-compose -f docker-compose.yml up --build
again. It's likely to get stuck in the same loop.Anything else:
Though it might not be the cause, https://github.com/apache/trafficcontrol/pull/3489 fixed some other related things, so it might be a good place to dive into for fixing this issue.