Closed Coservit closed 7 years ago
I found a culprit of the mixed-up messages.
The problem is the call to ftok function inside get_queue_id. Unfortunately the ftok with the current parameters (".", 9504 + pid) returns the same id for all pid(s) with the same least significant byte.
The ftok takes only the last byte into account from the id parameter (9504 + pid) in our case. We had a case of unfortunate coincidence that two active processes (23635 and 17747) tried to use the same IPC queue with the same id (1929423995).
I wrote the small program which only calls 'ftok' to generate the id for pid = 23635 and pid = 17747 and the result is:
key for 23635: 1929423995 key for 17747: 1929423995
so as you can see we have a collision for two numbers because of the equality of their last byte:
hex(9504+23635) = 0x8173 hex(9504+17747) = 0x6a73
Only the 0x73 is taken in account and as the first parameter is simple "." we have a collision.
I changed the code to generate a file in /tmp folder with pid in the name:
/tmp/ndo2db.24655.ipc.id for example and thus increasing the entropy.
Here is the new code of get_queue_id:
static char *queue_idfile_path = NULL;
int get_queue_id(int id) {
char fname[128];
int idfile;
if (queue_idfile_path == NULL)
{
asprintf(&queue_idfile_path, "/tmp/ndo2db.%d.ipc.id", id);
idfile=open(queue_idfile_path,O_RDWR | O_CREAT, S_IWUSR | S_IRUSR);
if (idfile < 0) {
free(queue_idfile_path);
queue_idfile_path = NULL;
} else {
close(idfile);
}
}
if (queue_id_file_path == NULL)
return -1;
key_t key = ftok(queue_idfile_path, NDO_QUEUE_ID+id);
if (key == -1 || (queue_id = msgget(key, IPC_CREAT | 0600)) < 0) {
syslog(LOG_ERR,"Error: queue init error.\n");
}
return queue_id;
}
Removal of the ipc.id file is done in del_queue:
void del_queue() {
.
.
.
if (queue_idfile_path != NULL)
{
unlink(queue_idfile_path);
free(queue_idfile_path);
queue_idfile_path = NULL;
}
}
The added pid file should assure good generation of IPC key id.
Or we should maybe consider to add support for mq_open, etc. POSIX queues ? But I am not entirely sure that we can use POSIX in the code for compatibility reason. Possibly using compilation macro through ./configure ???
Good find! Can you do a pull request? That way you get better credit.
Sorry for the delay - I created the PR #29. It contains also a fix for compilation error when --enable-ssl is passed to ./configure.
I'm closing the comment since the real action is in the PR. Thanks!
I'm going to say this, and only this:
NDO has been approved for a major overhaul (can you say multithreaded neb module with no weird abstraction, anyone?). I'll be getting started on this pretty soon.
I merged the PR #29 so that it is in there for reference for the rewrite. I didn't test it, personally. So if someone can confirm it works on say gentoo or cent or rhel, that'd be cool.
Hello,
we've updated our Nagios from 3.5 (with ndoutils 1.4b7) to Nagios 4.2.1 with the last ndoutils code (2.1.1). We use it in disruption-critical environment.
We encountered very strange problem where ndo2db mix-up suddenly messages comming from ndomod.o running on our Nagios box. We started to investigate the problem after we've noticed ndo2db: mysql_error messages in syslog.
For example, one of the errors:
As you can see it fails because 147825274178860 is 'invalid' timestamp with 178860 inadvertedly added.
We had complete logs for ndo2db for this problem and we can see this there (extract around the problematic line only):
The problematic line is:
[1478252888.697698] [001.2] [pid=30720] Handling: 61=147825274178860
Nearly all packets auround this line are mixed up.
I investigated the file in whole and everything is fine until following exchange:
after that everything is mixed-up. The HELLO sequence what you can see in the list above comes from 'service nagios reload' on the box where ndomod.o is running.
I tried to follow the code, tried to debug the code, wrote special 'nagios simulator' to attach directly to ndomod.o (small app in C++) but found no explanation for this message mix-up.
I am out of ideas for now, but maybe you can have any idea, or experience with similar problem ?? Or maybe somebody else have seen the same and found a solution ??
My last suspicion is that after the Nagios (4.2.1) reload there was more than one 'nagios with ndomod.o' sub-process running on the Nagios box - one failed to be killed during reload, one created after reload - and for some reason they both write to the same port and the same ndo2db process. But unfortunately the box was restarted in the meantime and I can't confirm that hypotesis right now.
Right now I am waiting for another example of the mysql_error to see if there is a multiplication of Nagios with ndomod.o processes to see if that can be the cause.
Can you please tell me if you have another idea how to investigate the problem ??
Just for reference, we've never seen anything like it with ndoutils 1.4b7, Nagios 3.5.
I am willing to assist you with any information you can need, any debugging I can do or logs to give you (this small excerise took 1GB of logs)
Thanks in advance,
Daniel