strange intermittent failure of yum on linodes

batrick / ceph-linode

Launch Ceph using the Linode VPS provider

GNU General Public License v3.0

13 stars 10 forks source link

strange intermittent failure of yum on linodes #38

Open bengland2 opened 5 years ago

bengland2 commented 5 years ago

I get this strange failure of yum that is non-reproducible - if I go back and re-run the command on the same cluster, it succeeds. Has anyone else seen that? I'm guessing the mirror site used by the yum repo was busy, is there a way to make yum more robust in the face of this by retrying? I'm going to try ansible yum module and see if that's more resilient. I could also try "yum -t --randomwait=1" , because maybe having all these linodes attack the yum repo server at the same time contributes to the problem.

$ ansible -m shell -a 'yum install -y wget yum-utils' all
 [WARNING]: Consider using yum module rather than running yum
mgr-000 | FAILED | rc=1 >>

 One of the configured repositories failed (Unknown),
 and yum doesn't have enough cached data to continue.

batrick commented 5 years ago

I haven't seen that particular error before.

bengland2 commented 5 years ago

and a consequence of this is that ceph-ansible retry never completes! ceph-linode retries ceph-ansible if it fails, but for some reason ceph-ansible never even attempts subsequent roles. See this jenkins log, look for "fatal" string in it, you'll see MDS role fail, but what's weird is that client-00N are not even touched in subsequent ceph-ansible retry. Of course this screws up everything that follows. I can work around it but it's really annoying. I can't see that ceph-linode is doing anything wrong, can you? Not sure what's going on here.

batrick commented 5 years ago

OH, THIS message "Cannot retrieve metalink for repository: epel/x86_64."

Ya, I saw that before. I never figured out why that happens :(

@ktdreyer @leseb have you ever seen this message or how to work around it?

@bengland2 in the mean time, you might just have the "retry" do all roles again especially since the cluster is small. That way the clients won't be skipped.

bengland2 commented 5 years ago

@batrick I expect the retry to do all roles again. That's the problem, it didn't on the 2nd ceph-ansible run:

PLAY [clients] *****************************************************************
skipping: no hosts matched

Clearly they were in the inventory file because the delegate facts task saw them. Well, maybe it was a full moon or I didn't sacrifice a small animal or something, I'll work around it for now.

Another problem I ran into - MDS keyring - when I re-ran launch.sh, the resulting cluster had no MDS, when I looked at the log, MDS couldn't boot because the keyring didn't work, then I had to remove the keyring and rerun ceph-ansible, then it worked. Either ceph-ansible should not change the keyring expected by the monitor, or it should forcibly delete whatever keyring was there before in /var/lib/ceph/mds/*/keyring

ktdreyer commented 5 years ago

The HTTP 503 error is a problem with the Fedora Infrastructure's MirrorManager software. It is designed to dynamically determine a list of "up to date" EPEL mirrors, and unfortunately that web app can sometimes return this error.

In the Sepia lab we've worked around this by editing Yum's configuration and statically defining a list of mirrors that are US-based and seem to be reliable.

I'm not sure of the exact cause for why MirrorManager is unreliable there. The latest bug I've seen is https://bugzilla.redhat.com/show_bug.cgi?id=1593033 so I've added a comment there today.

leseb commented 5 years ago

@batrick from time to time we see similar messages on our CI too, e;g.Failure talking to yum: Cannot find a valid baseurl for repo: base/7/x86_64.

Also see: https://github.com/ceph/ceph-ansible/commit/98cb6ed8f602d9c54b63c5381a17dbca75df6bc2