Open alfredodeza opened 7 years ago
+1
I ran into an issue on Friday where firewalld was getting in my way. Total user error but could've saved me some time if ceph-medic pointed this out.
@djgalloway was your issue a problem with nodes that were already part of the cluster and couldn't talk to each other? This is not an easy problem to solve (unfortunately). The other side of this problem is ensuring a pre-check. Quoting @jcsp here:
I'm talking about opening a TCP connection between two remote nodes to
verify that the network connectivity is working, and probably doing
this across a large set of pairs e.g. doing an all-to-all ping pong
between OSD nodes. Obviously, there is just the standard `ping`, but
I'm expecting that we'll want to test using actual TCP traffic in the
port ranges that the OSDs would use.
@alfredodeza I was adding additional nodes to an existing cluster. MONs specifically. I'd added two and removed two using ceph-ansible. Because two of the nodes I added had firewalld running, the MONs fell out of quorum and there was no easy/quick way for me to find that out without dumping mon_status and debugging until realizing firewalld was running and not configured to allow ceph traffic.
The network collector should include connectivity between Nodes where daemons are running.
Actual TCP traffic should be sent and maybe verify that the
CEPH_BANNER
is being sent:http://docs.ceph.com/docs/master/dev/network-protocol/