Closed davidw closed 9 years ago
For the record, I am seeing the same behaviour, R16B
OSX Mountain Lion.
Me too. R16B, Linux x86_64.
I am willing to bet that the problem has to do with locking and how R16B handles ports differently (asynchronously) compared to R15B. I'll probably take a deeper look into the code tomorrow.
Confirmed that it works in OTP_R15B02-17-g1a4e8f6
so it is something new which has made the problem appear. There are many port changes in R16B, so perhaps thats it.
A debug version of the Erlang VM seems to have caught something:
3> {ok,Log} = syslog:open("erlsyslog", [cons, perror, pid], user).
Locking 'driver_lock:syslog_drv[mutex]' lock which is already locked by thread!
Currently these locks are locked by the scheduler 2 thread:
'driver_lock:syslog_drv[mutex]'
Program received signal SIGABRT, Aborted.
[Switching to process 30647 thread 0x2103]
0x00007fff9226c212 in __pthread_kill ()
Paging doctor @vinoski.
Looking into it.
Ok, a backtrace shows the problem occurs in io.c:827
where we call create_port
. Now this call will try to take the driver_lock in io.c:344
, but this is trouble! The erl_driver already runs in the DRIVER LOCK mode (opposed to PORT LOCK mode). This means that then our syslogdrv_control
executes, we are holding that driver lock already. This leads to a double-lock which hangs the VM.
The inet_drv
uses the same style where a port driver is created from its control
callback, but that driver uses PORT LEVEL LOCKING
and thus it seems to avoid it since the port has the lock instead and that is not a problem to grab. The question is if it is safe to enable PORT LOCKING for this driver as well.
So the question is really if it is safe to have multiple callers executing inside the syslog_drv.c code.
I have been looking tinto this some more. There are concerns with the fix @lemenkov proposed I think. The concern is that openlog()
and closelog()
can be called with a race I think. They are not re-entrant, and I am not sure we have access to the reentrant versions, openlog_r()
and closelog_r()
. So I think we may open up a potential race between multiple schedulers trying to execute in the code section.
One possible solution is to protect the necessary syslog calls by a mutex, so only one scheduler is able to operate on the call itself at a time. This would mend the race window, but the change is a bit more involved then to fix.
I have a completely different fix that refactors the code to move the port creation out of the C code up into Erlang instead. I think it simplifies everything.
I like your fix better than mine. It is a move in a direction I like more than introducing more mutexes.
Just thought I'd mention here as well that @vinoski 's code appears to work for me, too, although now I'm a bit gunshy of using a compiled in port...
I am running it in production as well and it looks like it works fine.
Thanks to @vinoski this has been fixed, it seems.
I'm pretty sure I must be doing something wrong, but this code makes my system hang when I try and run it:
At this point it just hangs. I tracked things down in the C code to where it does this:
But I have no idea what it locks up there. I suspect I'm doing something wrong, but it'd be nice if I got a gentle message rather than a hung system.