ceph / ceph-medic

find common issues in ceph clusters
MIT License
22 stars 18 forks source link

collector: network connectivity between nodes #18

Open alfredodeza opened 7 years ago

alfredodeza commented 7 years ago

The network collector should include connectivity between Nodes where daemons are running.

Actual TCP traffic should be sent and maybe verify that the CEPH_BANNER is being sent:

http://docs.ceph.com/docs/master/dev/network-protocol/

djgalloway commented 6 years ago

+1

I ran into an issue on Friday where firewalld was getting in my way. Total user error but could've saved me some time if ceph-medic pointed this out.

alfredodeza commented 6 years ago

@djgalloway was your issue a problem with nodes that were already part of the cluster and couldn't talk to each other? This is not an easy problem to solve (unfortunately). The other side of this problem is ensuring a pre-check. Quoting @jcsp here:

I'm talking about opening a TCP connection between two remote nodes to
verify that the network connectivity is working, and probably doing
this across a large set of pairs e.g. doing an all-to-all ping pong
between OSD nodes.  Obviously, there is just the standard `ping`, but
I'm expecting that we'll want to test using actual TCP traffic in the
port ranges that the OSDs would use.
djgalloway commented 6 years ago

@alfredodeza I was adding additional nodes to an existing cluster. MONs specifically. I'd added two and removed two using ceph-ansible. Because two of the nodes I added had firewalld running, the MONs fell out of quorum and there was no easy/quick way for me to find that out without dumping mon_status and debugging until realizing firewalld was running and not configured to allow ceph traffic.