Vagabond / erlang-syslog

Erlang port driver for interacting with syslog via syslog(3)
Other
87 stars 56 forks source link

wedges system #9

Closed davidw closed 9 years ago

davidw commented 11 years ago

I'm pretty sure I must be doing something wrong, but this code makes my system hang when I try and run it:

syslog:start(). 
{ok,<0.195.0>}
(xyz@10.0.0.44)2> 
(xyz@10.0.0.44)2> {ok,Log} = syslog:open("erlsyslog", [cons, perror, pid], user).  

At this point it just hangs. I tracked things down in the C code to where it does this:

        nd->port = driver_create_port(port, pid, DRV_NAME, data);

But I have no idea what it locks up there. I suspect I'm doing something wrong, but it'd be nice if I got a gentle message rather than a hung system.

jlouis commented 11 years ago

For the record, I am seeing the same behaviour, R16B OSX Mountain Lion.

lemenkov commented 11 years ago

Me too. R16B, Linux x86_64.

jlouis commented 11 years ago

I am willing to bet that the problem has to do with locking and how R16B handles ports differently (asynchronously) compared to R15B. I'll probably take a deeper look into the code tomorrow.

jlouis commented 11 years ago

Confirmed that it works in OTP_R15B02-17-g1a4e8f6 so it is something new which has made the problem appear. There are many port changes in R16B, so perhaps thats it.

jlouis commented 11 years ago

A debug version of the Erlang VM seems to have caught something:

3> {ok,Log} = syslog:open("erlsyslog", [cons, perror, pid], user).
Locking 'driver_lock:syslog_drv[mutex]' lock which is already locked by thread!
Currently these locks are locked by the scheduler 2 thread:
  'driver_lock:syslog_drv[mutex]'

Program received signal SIGABRT, Aborted.
[Switching to process 30647 thread 0x2103]
0x00007fff9226c212 in __pthread_kill ()
Vagabond commented 11 years ago

Paging doctor @vinoski.

vinoski commented 11 years ago

Looking into it.

jlouis commented 11 years ago

Ok, a backtrace shows the problem occurs in io.c:827 where we call create_port. Now this call will try to take the driver_lock in io.c:344, but this is trouble! The erl_driver already runs in the DRIVER LOCK mode (opposed to PORT LOCK mode). This means that then our syslogdrv_control executes, we are holding that driver lock already. This leads to a double-lock which hangs the VM.

The inet_drv uses the same style where a port driver is created from its control callback, but that driver uses PORT LEVEL LOCKING and thus it seems to avoid it since the port has the lock instead and that is not a problem to grab. The question is if it is safe to enable PORT LOCKING for this driver as well.

So the question is really if it is safe to have multiple callers executing inside the syslog_drv.c code.

jlouis commented 11 years ago

I have been looking tinto this some more. There are concerns with the fix @lemenkov proposed I think. The concern is that openlog() and closelog() can be called with a race I think. They are not re-entrant, and I am not sure we have access to the reentrant versions, openlog_r() and closelog_r(). So I think we may open up a potential race between multiple schedulers trying to execute in the code section.

One possible solution is to protect the necessary syslog calls by a mutex, so only one scheduler is able to operate on the call itself at a time. This would mend the race window, but the change is a bit more involved then to fix.

vinoski commented 11 years ago

I have a completely different fix that refactors the code to move the port creation out of the C code up into Erlang instead. I think it simplifies everything.

jlouis commented 11 years ago

I like your fix better than mine. It is a move in a direction I like more than introducing more mutexes.

davidw commented 11 years ago

Just thought I'd mention here as well that @vinoski 's code appears to work for me, too, although now I'm a bit gunshy of using a compiled in port...

jlouis commented 11 years ago

I am running it in production as well and it looks like it works fine.

davidw commented 9 years ago

Thanks to @vinoski this has been fixed, it seems.