ganglia / monitor-core

Ganglia Monitoring core
BSD 3-Clause "New" or "Revised" License
491 stars 246 forks source link

Crash if override_hostname Contains Certain Numeric Values #49

Closed smith closed 12 years ago

smith commented 12 years ago

On a node using override_hostname = web02.example.com (not the real domain; I can give that to you if needed), gmond crashes after running for a short time with a message like this:

        sent message 'proc_run' of length 60 with 0 errors                                                                                                                                                                                                                                                              
*** glibc detected *** /usr/sbin/gmond: malloc(): memory corruption (fast): 0x0000000000ec6b30 ***
======= Backtrace: =========                                                 
/lib/libc.so.6(+0x77806)[0x7f932d096806]                                     
/lib/libc.so.6(+0x7bb39)[0x7f932d09ab39]                                     
/lib/libc.so.6(__libc_malloc+0x6e)[0x7f932d09b7de]                           
/usr/sbin/gmond(Ganglia_collection_group_send+0x145)[0x407335]               
/usr/sbin/gmond(process_collection_groups+0x9b)[0x4077fb]                    
/usr/sbin/gmond(main+0x3c6)[0x4094a6]                                        
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f932d03dc4d]                       
/usr/sbin/gmond[0x4047b9]                                                    
======= Memory map: ========                                                 
00400000-0041b000 r-xp 00000000 08:01 380073                             /usr/sbin/gmond
0061a000-0061b000 r--p 0001a000 08:01 380073                             /usr/sbin/gmond
0061b000-0061c000 rw-p 0001b000 08:01 380073                             /usr/sbin/gmond
0061c000-0061d000 rw-p 00000000 00:00 0                                      
00e85000-00ecd000 rw-p 00000000 00:00 0                                  [heap]
7f9324000000-7f9324021000 rw-p 00000000 00:00 0                              
7f9324021000-7f9328000000 ---p 00000000 00:00 0                              
7f932b365000-7f932b37b000 r-xp 00000000 08:01 344136                     /lib/libgcc_s.so.1
7f932b37b000-7f932b57a000 ---p 00016000 08:01 344136                     /lib/libgcc_s.so.1
7f932b57a000-7f932b57b000 r--p 00015000 08:01 344136                     /lib/libgcc_s.so.1
7f932b57b000-7f932b57c000 rw-p 00016000 08:01 344136                     /lib/libgcc_s.so.1
7f932b57c000-7f932b588000 r-xp 00000000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b588000-7f932b787000 ---p 0000c000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b787000-7f932b788000 r--p 0000b000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b788000-7f932b789000 rw-p 0000c000 08:01 346038                     /lib/libnss_files-2.11.1.so
7f932b789000-7f932b793000 r-xp 00000000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b793000-7f932b992000 ---p 0000a000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b992000-7f932b993000 r--p 00009000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b993000-7f932b994000 rw-p 0000a000 08:01 346028                     /lib/libnss_nis-2.11.1.so
7f932b994000-7f932b99c000 r-xp 00000000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932b99c000-7f932bb9b000 ---p 00008000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932bb9b000-7f932bb9c000 r--p 00007000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932bb9c000-7f932bb9d000 rw-p 00008000 08:01 346027                     /lib/libnss_compat-2.11.1.so
7f932bb9d000-7f932bba4000 r-xp 00000000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bba4000-7f932bda3000 ---p 00007000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bda3000-7f932bda4000 r--p 00006000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bda4000-7f932bda5000 rw-p 00007000 08:01 466978                     /usr/lib/ganglia/modsys.so
7f932bda5000-7f932bda6000 rw-p 00000000 00:00 0                              
7f932bda6000-7f932bdad000 r-xp 00000000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bdad000-7f932bfac000 ---p 00007000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bfac000-7f932bfad000 r--p 00006000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bfad000-7f932bfae000 rw-p 00007000 08:01 466974                     /usr/lib/ganglia/modproc.so
7f932bfae000-7f932bfb5000 r-xp 00000000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932bfb5000-7f932c1b4000 ---p 00007000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932c1b4000-7f932c1b5000 r--p 00006000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932c1b5000-7f932c1b6000 rw-p 00007000 08:01 466979                     /usr/lib/ganglia/modnet.so
7f932c1b6000-7f932c1bd000 r-xp 00000000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c1bd000-7f932c3bc000 ---p 00007000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c3bc000-7f932c3bd000 r--p 00006000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c3bd000-7f932c3be000 rw-p 00007000 08:01 466976                     /usr/lib/ganglia/modmem.so
7f932c3be000-7f932c3bf000 rw-p 00000000 00:00 0                              
7f932c3bf000-7f932c3c6000 r-xp 00000000 08:01 466977                     /usr/lib/ganglia/modload.so
7f932c3c6000-7f932c5c5000 ---p 00007000 08:01 466977                     /usr/lib/ganglia/modload.so
7f932c5c5000-7f932c5c6000 r--p 00006000 08:01 466977                     /usr/lib/ganglia/modload.so
7f932c5c6000-7f932c5c7000 rw-p 00007000 08:01 466977                     /usr/lib/ganglia/modload.soAborted

There are instances with names manager00.example.com, ops00.example.com, worker00.example.com, worker01.example.com, web01.example.com, web02.example.com, web03.example.com.

It only fails on web02 and web03, which are identical in every way except the name to web00 and web01.

(It crashes immediately if I include configuration for python modules btw)

Running on an EC2 m1.large instance on Ubuntu 10.04.4 LTS, gmond 3.3.8 (from the package at https://launchpad.net/~rufustfirefly/+archive/ganglia)

jbuchbinder commented 12 years ago

Are you also defining override_ip? If not, try defining override_ip to be the IP address of the spoofed name.

vvuksan commented 12 years ago

Nathan,

perhaps you can give me the actual names on the IRC channels. I should be able to replicate this.

smith commented 12 years ago

@jbuchbinder override_ip is defined. Here's the globals section:

globals {                                                                    
  daemonize = yes                                                            
  setuid = yes                                                                                                                                                                
  user = ganglia                                                             
  debug_level = 10                                                           
  max_udp_msg_len = 1472                                                     
  mute = no                                                                  
  deaf = no                                                                  
  host_dmax = 0 /*secs */                                                    
  cleanup_threshold = 300 /*secs */                                          
  gexec = no                                                                 
  send_metadata_interval = 60                                                
  override_hostname = web02.example.com                                        
  override_ip = 10.4.85.168                                                  
}                        

@vvuksan I sent you the domain name through the email on your website.

smith commented 12 years ago

@vvuksan That email didn't go through. The domains it was failing on were web02.who.is and web03.who.is

jbuchbinder commented 12 years ago

We recently committed a change to use an APR method rather than the manual string concatenation we had been doing before, so that should fix this. Could someone confirm?

smith commented 12 years ago

I've got this installed on a server, and FYI when using the passenger module from gmond-python-modules it fails on startup with:

{'status': 'sudo /usr/bin/passenger-status', 'memory_stats': 'sudo /usr/bin/passenger-memory-stats', 'metrix_prefix': 'passenger'}
apr_pollset_create failed: Invalid argument

That could be a bug in the python module, so I disabled it, and then I get:

apr_pollset_create failed: Invalid argument  

after loading the python module. This could be a problem with some module or configuration (I ran it with sudo /usr/local/sbin/gmond -d 2 --conf=/etc/ganglia/gmond.conf and installed the new version with --prefix=/usr/local.

I'll let you know if I can try it on a non-production machine with no extra modules. Let me know if there is anything else I can do to help you confirm.

jbuchbinder commented 12 years ago

Are you sure you have valid UDP and/or TCP listeners defined on that instance?

http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05559.html

smith commented 12 years ago

Thanks for that. I set set it to deaf. I'm running the commit for #49 shown above and it seems to fix the problem! Thanks