ganglia / monitor-core

Ganglia Monitoring core
BSD 3-Clause "New" or "Revised" License
489 stars 245 forks source link

Problem with gmond/gmetad parsing data #251

Open lukasz-aksamit opened 8 years ago

lukasz-aksamit commented 8 years ago

I use Ganglia on some of the servers. On all of the servers we use Ganglia 3.7.2. All servers aggregate data to single gmond and it is collected from it by gmetad. On every server we use different _overridehostname in gmond.conf. We use custom python monitoring plugins.

Sometimes, after some period of time gmetad reports: Process XML (PDC1): XML_ParseBuffer() error at line 55:#012not well-formed (invalid token)#012

Problematic line 55 in XML:

<HOST NAME="^_^O" IP="" TAGS="" REPORTED="1151315121" TN="508" TMAX="90" DMAX="86400" LOCATION="unspecified" GMOND_STARTED="0">
</HOST>

gmond -d reports only:

Incorrect format for spoof argument. exiting.
spoofIP: :^B
buff: :^B

gstat also cannot process XML data.

Then gmond requires restart...

Probably problematic is one of the servers which occasionally sends some malformed data. BUT the problem is that rest of the (good) data from other servers is not processed by gmetad! The rest of the servers are not monitored by the time of gmond is restarted... No other metrics are not processed.

Also it would be nice to add information to gmond debugging info about source IP address responsible for "Incorrect format for spoof argument. exiting." because we cant rely on spoofIP...

vvuksan commented 8 years ago

Hi Lukasz,

gmetad bails because XML is malformed :-(. We need to add some guards to make sure spoofed hosts are using valid characters. I will look into it.

lukasz-aksamit commented 8 years ago

Hi, Thanks for information. I've checked and on every server override_hostname is set to correct FQDN. If You need some more info please let me know.

NoodlesNZ commented 8 years ago

I'm starting to see this on my hosts as well, running gmond -d 2 it spits out something like:

saving metadata for metric: tx_drops_bond0 host: web2.example.com
 spoofName: web2.example.com    spoofIP: web2.example.com

Processing a metric value message from web2.example.com
***Allocating value packet for host--web2.example.com:web2.example.com-- and metric --tx_drops_bond0-- ****

Got a spoof message tx_drops_bond0 from web2.example.com:web2.example.com

Incorrect format for spoof argument. exiting.

spoofIP: �0

buff: �0

I'm only setting override_hostname = "web2.example.com". I probably need to set override_ip as well

NoodlesNZ commented 8 years ago

Ok, setting override_ip helps a little bit, but it still falls over. I wonder if python 2.6 (on CentOS 6) has updated a library or something that is breaking things here.

clwillard commented 7 years ago

Since upgrading from 3.5.0 to 3.7.2, I've been seeing this problem, too. I already had the override_hostname set, so I added the override_ip value for the node to the gmond.conf file and that didn't help. I could still see these message in the debug output when running "gmond -d 2". I then looked at the 3.7.2 gmond.c code to see where this message was coming from. The Ganglia_host_get function logs this message when there's no ':' in the metric_id->host value. This function is called by process_udp_recv_channel for the gmetadata_full case. For my case, this is running on a single node and collecting metrics locally so we shouldn't need to periodically send the metadata for the metrics, So I changed the send_metadata_interval from 60 to -1 to get this problem to go away. For the cluster case, we use multicasting, so I think we can still use -1 for the send_metadata_interval value.

gmond_debug_output.docx

stephensje commented 7 years ago

I have the same issue. If I use override_hostname on any node, gmond gives the "Incorrect format for spoof argument" error. It does not seem to matter what character string I use. Sometimes the renamed node will appear briefly in ganglia-web but then will show as "down".

There is a crude workaround, where you can go into the ganglia-web php files and set the $title variables for graphs to convert the default hostname to your preferred name.

xzhub commented 6 years ago

It's more than 2 years since the issue was reported, but I am still encountering the same problem. Any one has solution? I am using centos 7 with ganglia rpm 3.7.2.

beneschtech commented 5 years ago

Same issue here with CentOS 7, upgraded to latest (7.6 as of this writing) monitoring about 12 servers in our dev area. I finally gave up trying to figure it out, and just added a systemctl restart gmond to /etc/cron.daily . As a fellow software engineer, and as widely used as this software is, and the fact that this is a reproducible bug, not fixing it or even trying just shows apathy. If I had the time I'd fork and fix myself.

vvuksan commented 5 years ago

I get that this is frustrating however I find that someone who is most affected by this bug but doesn't "have time" to fix it yet chides others as apathetic to be a bit rich. This is a volunteer open source project. It's thankless work and we could use more empathy and help.VladimirOn Mar 2, 2019 12:32, Chris Benesch notifications@github.com wrote:Same issue here with CentOS 7, upgraded to latest (7.6 as of this writing) monitoring about 12 servers in our dev area. I finally gave up trying to figure it out, and just added a systemctl restart gmond to /etc/cron.daily . As a fellow software engineer, and as widely used as this software is, and the fact that this is a reproducible bug, not fixing it or even trying just shows apathy. If I had the time I'd fork and fix myself.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.

beneschtech commented 5 years ago

We've been working 12-14 hour days seven days a week for a month now. I was up until 11pm tracking down a bug last night alone. I really dont have the time, I wasnt just saying that. I know open source work is thankless and grueling, but then again it has been almost 3 years. You know what? I'll call your bluff. We have a demo on Monday and code is frozen until then, I'll see what I can figure out this weekend.

beneschtech commented 5 years ago

Ok, looked into it and there was a patch that takes care of most of the issues about 9 months ago. I dont know if its been moved to the CentOs repo yet. I made a few more changes, mostly cosmetic and one more condition to help exclude memory corruption. My guess is that override_hostname, both a config setting and in this case also a global variable is getting stepped on somewhere. Without running it in a debugger for 24 hours I dont know how to pin it down better. Anyway, built it on Cent 7 on the main dev aggregator host and if it stays up and sane for 24 hours, I'll submit a patch. Two key things to keep in mind:

1 - Make sure your config files have no non-ascii characters in them. 2 - Download the latest libconfuse-devel from repos and build this yourself. The latest patch from 9 months ago should fix 3 out of 4 poeple in this bug report.

mohamedmshokry commented 4 years ago

@beneschtech Sorry If I'm asking about this issue after a long time, but I encountered a similar issue with gmond flooding the /var/log/messages with

/usr/sbin/gmond[6514]: Incorrect format for spoof argument. exiting.

I just need to make sure that I understand you recommendations the right way:

  1. you are referring to gmond.conf ?
  2. Can I use latest libconfuse-devel released for CentOS https://centos.pkgs.org/7/epel-x86_64/libconfuse-devel-2.7-7.el7.x86_64.rpm.html ? or I should get the source and rebuild?