acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
4k stars 738 forks source link

SNMP doesn't properly register mibs #186

Closed dalehamel closed 9 years ago

dalehamel commented 9 years ago

I've been trying to set up SNMP support for Keepalived to create a highly available pain of NAT nodes. Below is my configuration

vrrp_script check_nat {
script "/etc/keepalived/check-nat.sh"
interval 2
}
vrrp_instance nat_instance {
debug 2
interface eth0
state BACKUP
virtual_router_id 53
priority 10
unicast_src_ip 172.29.22.255
unicast_peer {
172.29.24.50
}
track_script {
check_nat
}
notify_master /etc/keepalived/route-failover.rb
}

Unfortunately, even though keepalived seems to register itself with the SNMP daemon according to syslog:

Aug 26 00:08:22 nat0 Keepalived_vrrp[20494]: Starting SNMP subagent
Aug 26 00:08:22 nat0 Keepalived_healthcheckers[20493]: Starting SNMP subagent
Aug 26 00:08:22 nat0 Keepalived_vrrp[20494]: NET-SNMP version 5.7.2 AgentX subagent connected
Aug 26 00:08:22 nat0 Keepalived_healthcheckers[20493]: NET-SNMP version 5.7.2 AgentX subagent connected
Aug 26 00:08:22 nat0 snmpd[20173]: duplicate registration: MIB modules AgentX subagent 11, session 0x172e6b0, subsession 0x171d340 and AgentX subagent 13, session 0x1717600, subsession 0x1719130 (oid .1.3.6.1.4.1.9586.100.5.1.1).

Or by running keepalived in the foreground:

Starting Healthcheck child process, pid=9940
Initializing ipvs 2.6
Starting VRRP child process, pid=9941
Registering Kernel netlink reflector
Registering Kernel netlink command channel
Starting SNMP subagent
NET-SNMP version 5.7.2 AgentX subagent connected
Registering Kernel netlink reflector
Registering Kernel netlink command channel
Registering gratuitous ARP shared channel
Starting SNMP subagent
NET-SNMP version 5.7.2 AgentX subagent connected
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
registering pdu failed: 263!
Opening file '/etc/keepalived/keepalived.conf'.
Configuration is using : 5397 Bytes
Using LinkWatch kernel netlink reflector...
Opening file '/etc/keepalived/keepalived.conf'.
Configuration is using : 62247 Bytes
Using LinkWatch kernel netlink reflector...
VRRP_Instance(nat_instance) Entering BACKUP STATE
VRRP_Instance(nat_instance) Now in FAULT state

And I can even see it in the SNMPv2-MIB::sysORDescr table:

SNMPv2-MIB::sysORDescr.21 = STRING: The MIB module for Keepalived
SNMPv2-MIB::sysORDescr.22 = STRING: The MIB module for Keepalived

I get nothing when i try to walk it:

snmpwalk -v2c -cpublic localhost KEEPALIVED-MIB::vrrpInstanceState.1
KEEPALIVED-MIB::vrrpInstanceState.1 = No more variables left in this MIB View (It is past the end of the MIB tree)

Even if I try the root OID, I still get nothing:

snmpwalk -v2c -cpublic localhost  .1.3.6.1.4.1.9586.100.5
SNMPv2-SMI::enterprises.9586.100.5 = No more variables left in this MIB View (It is past the end of the MIB tree)

I'm running v1.2.13 but i have the same issues when I try with 1.2.19 (only it doesn't complain about duplicate registration, as they fixed that bug). I am on Ubuntu 14.04, apparmor disabled for debugging this. I have tried numerous other versions (v1.2.6-1.2.13, i can't get 1.2.5 or earlier to compile), with the same problem.

Here is my snmpd.conf and my snmp daemon config.

I've started reading through the keepalived source code, and it looks like the SNMP support has support for sending traps on state transitions, as in vrrp_state_become_master, but I'm not interested (much) in traps, I'm more interested in polling the current state of keepalived, which it looks like is supposed to be registered in snmp_agent_init by calling snmp_register_mib.

The call to register the MIBs seem to succeed, but it can never actually any values.

I built the latest master with some debug prints, curiously it seems that 'vrrp_snmp_instance' is never being called when i try to snmpget.

From my understanding of the source code, it looks like 'vrrp_vars' in vrrp_snmp.c contains a list of the different OIDs as 'variable8' structures, which contain a function pointer for which function should be called.

The KEEPALIVED-VRRP registers vrrp_vars to snmpd by calling 'register_mib' in 'vrrp_snmp_agent_init'.

If my understanding of how this works is correct, I would assume that when snmpd receives the request, it would call the function which it has received a reference to in this struct.

However, the function are never called, as I've placed both breakpoints and prints in it.

I'm at my wits end for how to debug this further, but the underlying issue is that the MIB doesn't seem to properly register, in that it is empty, and the functions that it is supposed to call are never called.

Any help would be appreciated!

dalehamel commented 9 years ago

ping @vincentbernat since i've had some contact with you, and read your blog post

vincentbernat commented 9 years ago

Oh, I thought you were posting on the mailing list.

The problem is quite odd. Could you check with strace if you see anything when you try to walk the MIB. The goal is to see if the master agent is sending requests. You should have 3 keepalived processes. The first one can be ignored. To find the VRRP process, use lsof -n -p PID. The VRRP process has some raw socket. You should also check what is the file descriptor for the AgentX socket (this is the Unix socket). Then, with strace, see if there is some activity on this socket.

If not, there is a configuration problem with the master agent.

Tell me if I need to expand on this information.

dalehamel commented 9 years ago

Oh, I thought you were posting on the mailing list.

I thought i submitted it, I must have screwed that up - my apologies.

in my case, it looks like it's this process, as it has the raw socket you mentioned:

keepalive 12378 root    0u      CHR                1,3      0t0    1041 /dev/null
keepalive 12378 root    1u      CHR                1,3      0t0    1041 /dev/null
keepalive 12378 root    2u      CHR                1,3      0t0    1041 /dev/null
keepalive 12378 root    3u     unix 0xffff880096d8e580      0t0 1368064 socket
keepalive 12378 root    4r     FIFO                0,8      0t0 1368572 pipe
keepalive 12378 root    5w     FIFO                0,8      0t0 1368572 pipe
keepalive 12378 root    6u  netlink                         0t0 1368575 ROUTE
keepalive 12378 root    7u  netlink                         0t0 1368576 ROUTE
keepalive 12378 root    8u     pack            1368577      0t0    RARP type=SOCK_RAW
keepalive 12378 root    9u     pack            1368578      0t0    IPV6 type=SOCK_RAW
keepalive 12378 root   10r     FIFO                0,8      0t0 1368579 pipe
keepalive 12378 root   11w     FIFO                0,8      0t0 1368579 pipe
keepalive 12378 root   12r     FIFO                0,8      0t0 1368580 pipe
keepalive 12378 root   13w     FIFO                0,8      0t0 1368580 pipe
keepalive 12378 root   14u     unix 0xffff880099613100      0t0 1368581 socket
keepalive 12378 root   15u      raw                         0t0 1368598 00000000:0070->00000000:0000 st=07
keepalive 12378 root   16u      raw                         0t0 1368599 00000000:0070->00000000:0000 st=07

Running strace (filtering out gettimeof day and select since they were spamming):

strace -p 12378 2>&1 | grep -v 'gettime\|select'
Process 12378 attached
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0K\0\0\377p2\363\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0L\0\0\377p2\362\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0M\0\0\377p2\361\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0N\0\0\377p2\360\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0O\0\0\377p2\357\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0P\0\0\377p2\356\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0Q\0\0\377p2\355\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0R\0\0\377p2\354\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0S\0\0\377p2\353\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36
sendmsg(16, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.29.24.50")}, msg_iov(1)=[{"E\300\0$\0T\0\0\377p2\352\254\35\26\377\254\35\0302!5\n\0\0\1\324\311\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, 0) = 36

It doesn't seem anything is happening here, I ran snmpwalk numerous times as below:

snmpwalk -v2c -cpublic localhost  .1.3.6.1.4.1.9586.100.5

The strace traffic seems to just be keepalived pinging its peer, seems like the snmp calls never actually make it there... : /

To make the strace less noisy, i disabled my check scripts temporarily:

vrrp_instance nat_instance {
  debug 2
  interface eth0
  state BACKUP
  virtual_router_id 53
  priority 10
  unicast_src_ip 172.29.22.255
  unicast_peer {
    172.29.24.50

  }
}
dalehamel commented 9 years ago

Something worth noting though, select seems to be continuously timing out:

select(1024, [4 6 10 12 14 15], [], [], {0, 20160}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 999922}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 630799}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 368098}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 999965}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 999962}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 999909}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 999941}) = 0 (Timeout)
select(1024, [4 6 10 12 14 15], [], [], {0, 999960}) = 0 (Timeout)

Though i don't think any of those fds are the raw socket?

vincentbernat commented 9 years ago

select() timing out is expected if nothing is received (it times out every second to be able to send VRRP packets). You should get something on file descriptor 14. Could you try with this very simple configuration for the master agent?

rocommunity public
master agentx

This way, no access control, no filtering will be done.

dalehamel commented 9 years ago

Yup, that did it.

Thanks for your help! :heart:

I've literally been banging my head against this problem all day. @vincentbernat

vincentbernat commented 9 years ago

I did miss your original configuration. The public community has a limited view systemonly. You could also add the appropriate OID to view systemonly included.

dalehamel commented 9 years ago

Makes sense! Thanks for the assistance!

On Wednesday, August 26, 2015, Vincent Bernat notifications@github.com wrote:

I did miss your original configuration. The public community has a limited view systemonly. You could also add the appropriate OID to view systemonly included.

— Reply to this email directly or view it on GitHub https://github.com/acassen/keepalived/issues/186#issuecomment-135158501.