FransUrbo / snmp-modules

Miscellaneous SNMP modules and drivers
3 stars 2 forks source link

SNMP counter overflows #1

Open waddles opened 9 years ago

waddles commented 9 years ago

Great work on developing these modules but I seem to be overflowing the 32bit counters for my zpool info:

root@rubicon:~# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data    29T  3.59T  25.4T         -     3%    12%  1.00x  ONLINE  -

root@rubicon:~# snmptable -u chameleon6287188769 -c chameleon6287188769 -v 2c rubicon zfsPoolStatusTable
SNMP table: BAYOUR-COM-MIB::zfsPoolStatusTable

 zfsPoolName zfsPoolSize zfsPoolAlloc zfsPoolFree zfsPoolCap zfsPoolDedup zfsPoolHealth zfsPoolAltRoot zfsPoolUsedBySnaps zfsPoolUsed
        data           0    171798691  1717986918         12         1.00        online              -                  0   680399994
BAYOUR-COM-MIB::zfsPoolStatusTable: WARNING: More columns on agent than in MIB

Seems ok coming out of the perl script:

OID_BASE.5.1.2.1
string
data
OID_BASE.5.1.3.1
integer
31885837205504
OID_BASE.5.1.4.1
integer
3947246743715.84
OID_BASE.5.1.5.1
integer
27927595345510.4
OID_BASE.5.1.6.1
integer
12
OID_BASE.5.1.7.1
string
1.00
OID_BASE.5.1.8.1
integer
4
OID_BASE.5.1.9.1
string
-

Any suggestions?

FransUrbo commented 9 years ago

The MIB was 'thrown together' without much regard (but 'some') to what the values actually where, so I'm not overly surprised by this. I haven't been running it myself in a while, because I have stability issues because of 'load sensitivity' on my primary (bad SAS/SATA card/driver).

The Integer32 value on some/all of these needs to be updated with the factual size of the value. This require going through the code in ZFS/ZoL..

I'll see what I can do, but if you have concrete changes, feel free to open a pull request.

waddles commented 9 years ago

Ok so I changed the MIB to use Integer64 for the values in zfsPoolStatusTable but Net-SNMP still does not return them properly. Then I found this patch https://sourceforge.net/p/net-snmp/patches/737/ but it does not appear to have been applied. I am running Ubuntu Vivid (15.04) with Net-SNMP 5.7.2 but even latest upstream doesn't look like it handles it properly.

FransUrbo commented 9 years ago

Then I don't know off-hand what to do :(

waddles commented 9 years ago

On a side note, I love the clean code in https://github.com/calmh/solaris-extra-snmp/blob/master/zfs-snmp although it depends on kstat and doesn't appear to keep persistency, but that could be fixed fairly easily.

I think a better way of getting the zpool usage (instead of using zpool iostat then converting it to a somewhat rough estimate by multiplying by powers of 1024) is to use zfs list -p <pool>

# zpool iostat
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        3.74T  25.3T     41    129  4.46M  3.79M
# zfs list -p
NAME                      USED           AVAIL          REFER  MOUNTPOINT
data             3577557345060  23702699187420          30260  /data
data/atlassian      2227441270  23702699187420     2227441270  /data/atlassian
data/backup      3570098684040  23702699187420  3570098684040  /data/backup
data/bamboo          234328990  23702699187420      234328990  /data/bamboo
data/confluence     1980329210  23702699187420     1980329210  /data/confluence
data/crowd            47012470  23702699187420       47012470  /data/crowd
data/jira           2189892170  23702699187420     2189892170  /data/jira
data/postgresql      434886040  23702699187420      434886040  /data/postgresql
data/stash           268084910  23702699187420      268084910  /data/stash
# zfs list -p data
NAME           USED           AVAIL  REFER  MOUNTPOINT
data  3577557345060  23702699187420  30260  /data

Total capacity is obviously the sum of all 3 values

FransUrbo commented 9 years ago

That still don't help unfortunately. 3577557345060 + 23702699187420 + 30260 = 27280256562740 which is still much, much higher than the maximum value of a (unsigned) 32-bit int (which is 4,294,967,295). The signed int is half that...

The maximum value of a (unsigned) 64-bit int is 18,446,744,073,709,551,615 (which would allow for 18445 petabyte :), which is plenty high. A signed 64-bit int is half that. Don't know what the Integer64 would be, signed or unsigned, but either way, that would do it. But if the snmpd doesn't support it, it's not much I can do :(

Discussion actually jogs some distant memories though. It feels like I've had this discussion with myself but couldn't solve it…

https://en.wikipedia.org/wiki/Integer_(computer_science)#Common_integral_data_types

FransUrbo commented 9 years ago

I've been trying to do something about this in https://github.com/FransUrbo/snmp-modules/tree/int64_size-free, but it didn't work as I expected.

FransUrbo commented 9 years ago

With this patch on the 5.7.3+dfsg-1 version, I got it to work. I'm currently trying to figure out how to implement this in the MIB.

http://sourceforge.net/p/net-snmp/mailman/message/34285720/

FransUrbo commented 9 years ago

I took your recommendation to use zfs get to get the exact sizes, instead of the "human readable" values one gets from zpool list and "translate" that into bytes. There was a slight mismatch there. On my system, there was a 29MB discrepancy.

I'm still trying to figure out how to fix the MIB. BUT, the code in the int64_size-free branch will now correctly return a integer64 instead of a integer32:

$ snmpget localhost zfsPoolSize zfsPoolSize
BAYOUR-COM-MIB::zfsPoolSize.1 = Opaque: Int64: 8256506880
BAYOUR-COM-MIB::zfsPoolSize.1 = Opaque: Int64: 8256506880

The fact that it returns a Opaque: Int64 and not a Integer64 is the current problem. Not quite sure how to fix that just yet. I have some test MIB entries in that branch, but they don't seem to be working. I think I'm roughly on the right track here. There's something about the https://tools.ietf.org/html/draft-perkins-bigint-00 I need to figure out.

waddles commented 9 years ago

https://tools.ietf.org/html/draft-perkins-opaque-01 might help you understand more.

Looking at that patch and the file it applies to, that section of code is all about unsigned longs which means it should be returning a type of ASN_OPAQUE_U64 and have a definition of 'Unsigned64'. That then leaves no 'integer64' (signed) in which case the #ifdef probably also needs another clause added to handle signed 64-bit integers. The implementation would be the same for all 3 if I'm not wrong.

The difference between Counter64, Integer64 and Unsigned64 is that Counters don't decrease and of course the interpretation of +/-. For our purposes we really want Unsigned64.

See also https://sourceforge.net/p/net-snmp/code/ci/1b4ca14972d39d61a93bb0e3e4eea76795bedb89/tree/include/net-snmp/library/asn1.h line 80 and onwards.

FransUrbo commented 9 years ago

Tripple checking and actually LOOKING at the code more closely this time, you're probably right. Using a unsigned instead of signed in the code, because we don't need negative values,

In practice though, it shouldn't really matter right now. We can return a 9ZB value (instead of a 18ZB value with unsigned). That still isn't enough to account for the total size of a ZFS pool :). But it should be enough for almost everyone. For now. To be able to return the value of the maximum size of a ZFS pool (256ZB), we need a 128bit value!

However, although you're right in that, the problem is currently how to incorporate that into the MIB. I have added both a I64 and a U64, but neither work as expected.

But I'm starting to wonder if it matter if it returns a Integer64 instead of Opaque: Int64. The value is what we need, not the type…

Could you try the int64_size-free branch and see if it works for you?

FransUrbo commented 9 years ago

I've taken your suggestions for net-snmp and walked (not ran :) with it - http://sourceforge.net/p/net-snmp/mailman/message/34291537/.

However, my two patches isn't included in the web archive for some reason.

https://gist.github.com/FransUrbo/a2bfee606ffda0b7b81e https://gist.github.com/FransUrbo/b891f94b1100f2a3b251

This gives me:

# for i in {1..6}; do snmpget localhost .1.3.6.1.4.1.22222.42.$i.0; done
SNMPv2-SMI::enterprises.22222.42.1.0 = INTEGER: 123456
SNMPv2-SMI::enterprises.22222.42.2.0 = Opaque: Int64: 9223372036854775806
SNMPv2-SMI::enterprises.22222.42.3.0 = Counter32: 123456
SNMPv2-SMI::enterprises.22222.42.4.0 = Counter64: 9223372036854775806
SNMPv2-SMI::enterprises.22222.42.5.0 = Gauge32: 4294967294
SNMPv2-SMI::enterprises.22222.42.6.0 = Opaque: UInt64: 18446744073709551614

which seems to just fine (except that instead of a UInt32 (or whatever it should have been), I get a Gauge32). No biggie, but it looks strange...

FransUrbo commented 9 years ago

Don't seem to need any special stuff in the MIB. Just made the zfsPoolSize and zfsPoolSize and Integer64 (although smiling complains about this) and return a unsigned64 value from the agent and this all seems to be working just fine!

# snmpget localhost zfsPoolSize zfsPoolSize
BAYOUR-COM-MIB::zfsPoolSize.1 = Opaque: UInt64: 8256506880
BAYOUR-COM-MIB::zfsPoolSize.1 = Opaque: UInt64: 8256506880
# snmpwalk localhost zfsPoolStatusTable
BAYOUR-COM-MIB::zfsPoolStatusIndex.1 = INTEGER: 1
BAYOUR-COM-MIB::zfsPoolStatusIndex.2 = INTEGER: 2
BAYOUR-COM-MIB::zfsPoolName.1 = STRING: rpool
BAYOUR-COM-MIB::zfsPoolName.2 = STRING: rpool 2
BAYOUR-COM-MIB::zfsPoolGUID.1 = STRING: 4977845871582736322
BAYOUR-COM-MIB::zfsPoolGUID.2 = STRING: 3787144349319647945
BAYOUR-COM-MIB::zfsPoolSize.1 = Opaque: UInt64: 8256506880
BAYOUR-COM-MIB::zfsPoolSize.2 = Opaque: UInt64: 8256506880
BAYOUR-COM-MIB::zfsPoolAlloc.1 = INTEGER: 132096
BAYOUR-COM-MIB::zfsPoolAlloc.2 = INTEGER: 111616
BAYOUR-COM-MIB::zfsPoolFree.1 = Opaque: UInt64: 8256374784
BAYOUR-COM-MIB::zfsPoolFree.2 = Opaque: UInt64: 8256395264
BAYOUR-COM-MIB::zfsPoolCap.1 = INTEGER: 0
BAYOUR-COM-MIB::zfsPoolCap.2 = INTEGER: 0
BAYOUR-COM-MIB::zfsPoolDedup.1 = STRING: 1.00
BAYOUR-COM-MIB::zfsPoolDedup.2 = STRING: 1.00
BAYOUR-COM-MIB::zfsPoolHealth.1 = INTEGER: online(4)
BAYOUR-COM-MIB::zfsPoolHealth.2 = INTEGER: online(4)
BAYOUR-COM-MIB::zfsPoolAltRoot.1 = STRING: -
BAYOUR-COM-MIB::zfsPoolAltRoot.2 = STRING: -
BAYOUR-COM-MIB::zfsPoolUsedBySnaps.1 = INTEGER: 0
BAYOUR-COM-MIB::zfsPoolUsedBySnaps.2 = INTEGER: 0
BAYOUR-COM-MIB::zfsPoolUsed.1 = INTEGER: 282624
BAYOUR-COM-MIB::zfsPoolUsed.2 = INTEGER: 111616
# snmptable -CB localhost zfsPoolStatusTable         
SNMP table: BAYOUR-COM-MIB::zfsPoolStatusTable

 zfsPoolName         zfsPoolGUID zfsPoolSize zfsPoolAlloc zfsPoolFree zfsPoolCap zfsPoolDedup zfsPoolHealth zfsPoolAltRoot zfsPoolUsedBySnaps zfsPoolUsed
       rpool 4977845871582736322  8256506880       132096  8256374784          0         1.00        online              -                  0      282624
     rpool 2 3787144349319647945  8256506880       111616  8256395264          0         1.00        online              -                  0      111616
# 
FransUrbo commented 9 years ago

zfsPoolAlloc also needs to be a UInt64, just-in-case...

FransUrbo commented 9 years ago

Same code on a host that doesn't have a patched Net-SNMP:

# snmpget localhost zfsPoolSize zfsPoolSize zfsPoolAlloc
BAYOUR-COM-MIB::zfsPoolSize.1 = Gauge32: 3961545728
BAYOUR-COM-MIB::zfsPoolSize.1 = Gauge32: 3961545728
BAYOUR-COM-MIB::zfsPoolAlloc.1 = Gauge32: 51384320
# snmpwalk localhost zfsPoolStatusTable
BAYOUR-COM-MIB::zfsPoolStatusIndex.1 = INTEGER: 1
BAYOUR-COM-MIB::zfsPoolName.1 = STRING: rpool
BAYOUR-COM-MIB::zfsPoolGUID.1 = STRING: 11847949639043149139
BAYOUR-COM-MIB::zfsPoolSize.1 = Gauge32: 3961545728
BAYOUR-COM-MIB::zfsPoolAlloc.1 = Gauge32: 51384320
BAYOUR-COM-MIB::zfsPoolFree.1 = Gauge32: 3910161408
BAYOUR-COM-MIB::zfsPoolCap.1 = INTEGER: 0
BAYOUR-COM-MIB::zfsPoolDedup.1 = STRING: 1.00
BAYOUR-COM-MIB::zfsPoolHealth.1 = INTEGER: online(4)
BAYOUR-COM-MIB::zfsPoolAltRoot.1 = STRING: -
BAYOUR-COM-MIB::zfsPoolUsedBySnaps.1 = INTEGER: 0
BAYOUR-COM-MIB::zfsPoolUsed.1 = INTEGER: 153585254
# snmptable -CB localhost zfsPoolStatusTable
SNMP table: BAYOUR-COM-MIB::zfsPoolStatusTable

 zfsPoolName          zfsPoolGUID zfsPoolSize zfsPoolAlloc zfsPoolFree zfsPoolCap zfsPoolDedup zfsPoolHealth zfsPoolAltRoot zfsPoolUsedBySnaps zfsPoolUsed
       rpool 11847949639043149139  3961545728     51384320  3910161408          0         1.00        online              -                  0   153585254
# zfs get -H -oproperty,value -p used,available,referenced rpool
used    51384320
available       8205103104
referenced      25600
# expr 51384320 + 8205103104 + 25600 ; echo 3961545728
8256513024
3961545728
# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool  7.94G  49.1M  7.89G         -      -     0%  1.00x  ONLINE  -
FransUrbo commented 9 years ago

Querying a unpatched server from a OSX Lion:

$ snmptable -CB unpatched-server zfsPoolStatusTable
SNMP table: BAYOUR-COM-MIB::zfsPoolStatusTable

 zfsPoolName          zfsPoolGUID zfsPoolSize zfsPoolAlloc zfsPoolFree zfsPoolCap zfsPoolDedup zfsPoolHealth zfsPoolAltRoot zfsPoolUsedBySnaps zfsPoolUsed
       rpool 11847949639043149139  3961545728     51362816  3910182912          0         1.00        online              -                  0   153585254

And to the patched server:

$ snmptable -CB patched-server zfsPoolStatusTable 
SNMP table: BAYOUR-COM-MIB::zfsPoolStatusTable

 zfsPoolName         zfsPoolGUID                                      zfsPoolSize                                     zfsPoolAlloc                                      zfsPoolFree zfsPoolCap zfsPoolDedup zfsPoolHealth zfsPoolAltRoot zfsPoolUsedBySnaps zfsPoolUsed
       rpool 4977845871582736322 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00           0         1.00        online              -                  0      613376
     rpool 2 3787144349319647945 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  2D 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00           0         1.00        online              -                  0      437248

So I guess the patch still needs some work. Or possibly the MIB.