ganglia / ganglia-web

Ganglia Web Frontend
BSD 3-Clause "New" or "Revised" License
316 stars 170 forks source link

one host dead in cluster,all host data can't be show by gweb #265

Open CrystalCat opened 9 years ago

CrystalCat commented 9 years ago

Dear vvuksan, recently I try to deploy ganglia in my production env,but something went wrong ...

env:
    selinux:disabled
    firewall: disabled
    os: redhat6.6 x64
    transfermethod:  unicast
    ganglia pkgs(from epel):
       ganglia-gmetad-3.7.1-1.el6.x86_64
       ganglia-web-3.6.2-1.el6.x86_64
       ganglia-gmond-3.7.1-1.el6.x86_64
       ganglia-3.7.1-1.el6.x86_64
     hosts:
            X86-Manage1:
                ip: 172.20.31.131
                components:gmetad,gmond,gweb 
                gridname: DP_Ganglia
                ds:

                    data_source "DP_Ganglia" 10  X86-Manage2 # a down level  cluster's gmeted

            X86-Manage2:
                ip: 172.20.31.132
                components:gmetad,gmond 
                gridname: DP_Ganglia
                cluster:NewBilling
                ds:
                    data_source "NewBilling" 10  X86-Manage2   #self

           PMC_WEB_SRV3/4:
                ip: 172.20.31.35/36
                components:gmond 
                cluster:NewBilling
     Topography:
        X86-Manage1 : is the topmost gmetad with gweb,collect X86-Manage2's gmetad data.
        X86-Manage2 : cluster 'NewBilling' gmetad node,take care of cluster 'NewBilling' 's all gmond data 
        PMC_WEB_SRV3/4: hosts to be monitor

configuration sample: PMC_WEB_SRV3/4 : gmond.conf

cluster {
  name = "NewBilling"
}
udp_send_channel {
  host = X86-Manage2
  port = 8649
  ttl = 1
}

udp_recv_channel {
  port = 8649
} 

tcp_accept_channel {
  port = 8649
  gzip_output = no 
} 

all gmetad.conf are configured as hosts definitions

probleam: everything works pretty well while all gxxd running,if I shutdown host PMC_WEB_SRV3's gmond daemon (service gmond stop), on X86-Manage2 gstat shows

CLUSTER INFORMATION
       Name: NewBilling
      Hosts: 2
Gexec Hosts: 0
 Dead Hosts: 1
  Localtime: Fri Jun 26 16:40:12 2015

There are no hosts running gexec at this time

gmetad detected the lost connection with the host's gmond I just killed OK,let's visit gweb GUI,no matter which host I select,non of them shows Graph

when PMC_WEB_SRV3's gmond recover, gweb become normal again.

note: during PMC_WEB_SRV3's gmond stop, all metrics from other hosts in cluster NewBilling are revieved by X86-Manage2 .

pls help to analyze what cause this symptom,a bug? or some where I configured wrong,million tks!

CrystalCat commented 9 years ago

update: I believe hostname's case cause this problem,if all hostname using ip or lowercase hostname,problem won't occur again.