Open bengland2 opened 5 years ago
I haven't seen that particular error before.
and a consequence of this is that ceph-ansible retry never completes! ceph-linode retries ceph-ansible if it fails, but for some reason ceph-ansible never even attempts subsequent roles. See this jenkins log, look for "fatal" string in it, you'll see MDS role fail, but what's weird is that client-00N are not even touched in subsequent ceph-ansible retry. Of course this screws up everything that follows. I can work around it but it's really annoying. I can't see that ceph-linode is doing anything wrong, can you? Not sure what's going on here.
OH, THIS message "Cannot retrieve metalink for repository: epel/x86_64."
Ya, I saw that before. I never figured out why that happens :(
@ktdreyer @leseb have you ever seen this message or how to work around it?
@bengland2 in the mean time, you might just have the "retry" do all roles again especially since the cluster is small. That way the clients won't be skipped.
@batrick I expect the retry to do all roles again. That's the problem, it didn't on the 2nd ceph-ansible run:
PLAY [clients] *****************************************************************
skipping: no hosts matched
Clearly they were in the inventory file because the delegate facts task saw them. Well, maybe it was a full moon or I didn't sacrifice a small animal or something, I'll work around it for now.
Another problem I ran into - MDS keyring - when I re-ran launch.sh, the resulting cluster had no MDS, when I looked at the log, MDS couldn't boot because the keyring didn't work, then I had to remove the keyring and rerun ceph-ansible, then it worked. Either ceph-ansible should not change the keyring expected by the monitor, or it should forcibly delete whatever keyring was there before in /var/lib/ceph/mds/*/keyring
The HTTP 503 error is a problem with the Fedora Infrastructure's MirrorManager software. It is designed to dynamically determine a list of "up to date" EPEL mirrors, and unfortunately that web app can sometimes return this error.
In the Sepia lab we've worked around this by editing Yum's configuration and statically defining a list of mirrors that are US-based and seem to be reliable.
I'm not sure of the exact cause for why MirrorManager is unreliable there. The latest bug I've seen is https://bugzilla.redhat.com/show_bug.cgi?id=1593033 so I've added a comment there today.
@batrick from time to time we see similar messages on our CI too, e;g.Failure talking to yum: Cannot find a valid baseurl for repo: base/7/x86_64
.
Also see: https://github.com/ceph/ceph-ansible/commit/98cb6ed8f602d9c54b63c5381a17dbca75df6bc2
I get this strange failure of yum that is non-reproducible - if I go back and re-run the command on the same cluster, it succeeds. Has anyone else seen that? I'm guessing the mirror site used by the yum repo was busy, is there a way to make yum more robust in the face of this by retrying? I'm going to try ansible yum module and see if that's more resilient. I could also try "yum -t --randomwait=1" , because maybe having all these linodes attack the yum repo server at the same time contributes to the problem.