Closed hesselink closed 10 years ago
My hunch is that it's related to: https://github.com/bos/pool/commit/73b14736dfe78621e6299517d6477ad08bb89354. @kim Do you have an idea what is going wrong?
@hesselink is it possible you can debug your program with gdb so we can get a stack trace?
We tried that briefly, but there seemed to be no useful info in the stack trace, no symbols. I just did it again, here's what I got (manuel
is the name of the executable):
manuel(27727,0x106a81000) malloc: *** error for object 0x2000000000000000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x180b of process 27727]
0x00007fff93103866 in ?? ()
(gdb) bt
#0 0x00007fff93103866 in ?? ()
#1 0x00007fff9205335c in ?? ()
#2 0x0000000000000000 in ?? ()
Using lldb
I got some more useful stuff:
manuel(28308,0x107981000) malloc: *** error for object 0xa000000000000000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Process 28308 stopped
* thread #9: tid = 0x492879, 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
frame #0: 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill + 10:
-> 0x7fff93103866: jae 0x7fff93103870 ; __pthread_kill + 20
0x7fff93103868: movq %rax, %rdi
0x7fff9310386b: jmpq 0x7fff93100175 ; cerror_nocancel
0x7fff93103870: ret
(lldb) bt
* thread #9: tid = 0x492879, 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
* frame #0: 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff9205335c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x00007fff8f5cab1a libsystem_c.dylib`abort + 125
frame #3: 0x00007fff970b207f libsystem_malloc.dylib`free + 411
frame #4: 0x000000010410ed94 libpq.dylib`freePGconn + 164
frame #5: 0x0000000101da2517 manuel`PQfinish_app + 23
frame #6: 0x0000000101d76032 manuel`___lldb_unnamed_function480241$$manuel + 18
Ok, so it looks like the PostgreSQL connection is destroyed twice.
@hesselink when does the bug occur? I would suspect when the Pool gets garbage collected. Or does it happen sooner?
Another question: do you call destroyAllResources
from your program?
If it happens on garbage-collection time it might be some strange interaction between purging the local pools and the reaper thread. Although I can't explain why since both purgeLocalPool
and reaper
use STM stransactions to manage the list of idle resources.
@basvandijk you think newTVarIO
is actually unsafe? that'd be pretty bad news. @hesselink maybe you could try replacing the LocalPool
s initialisation code with something like:
-- ...
-- line 160:
localPools <- V.replicateM numStripes $ do
fin <- newIORef ()
lp <- atomically $ liftM2 LocalPool (newTVar 0) (newTVar [])
return $ lp fin
-- ...
At least we could then see if Bas' suspicion of some kind of race holds.
@basvandijk The bug occurs when our application starts up, around the point it makes a connection to fake-s3 using the aws library. If I don't make this connection, the crash doesn't occur. Before this, we make an (pooled) connection to the database, using a new pool that's garbage afterwards. If I remove this call, the crash goes away as well.
We don't call destroyAllResources
, although I can't speak for all our transitive dependencies.
@kim I tried that code, it still crashes.
I just did a git bisect
, and it is indeed 73b14736dfe78621e6299517d6477ad08bb89354 that is the culprit.
We don't call destroyAllResources, although I can't speak for all our transitive dependencies.
Ok, I don't think your transitive dependencies use it either since it's a very new addition (I added it less than a week ago).
I just did a git bisect, and it is indeed 73b1473 that is the culprit.
Ok that really helps!
Could you trace the calls to your destroy function (i.e. the function you pass to createPool)? And see if a double-free is really going on.
@hesselink are you using HDBC-postgresql?
They seem to do some strange things with finalizers. Particularly, I think this won't ever return 0
: https://github.com/hdbc/hdbc-postgresql/blob/master/Database/HDBC/PostgreSQL/Connection.hsc#L202 -- which is assumed here: https://github.com/hdbc/hdbc-postgresql/blob/master/Database/HDBC/PostgreSQL/Connection.hsc#L59
The thing is, the purpose of my patch was to ensure that the resources actually get destroyed. We had some issues with leaking file descriptors and it turned out that the reaper thread would get killed immediately and resources would never be finalized (cf. #15). Could it be that in your case actually closing the postgres connection would always crash the program?
Please disregard the pointers to the hdbc-postgresql source, that's unrelated (just quickly skimming the source).
After digging a little more into the HDBC source I found even more acrobatics involving weak refs. It looks quite likely to me that finalizers from both resource-pool and HDBC are trying to free the connection concurrently. I'm afraid there's not so much we can do from Data.Pool
, though. Maybe just pass a no-op destroy action and let HDBC deal with the cleanup?
@kim Yes, we're using HDBC-postgresql.
@basvandijk Our destroy function is just disconnect
from HDBC.
@kim Will that actually work? I'd rather not run out of DB connections at some point because they are not being closed properly.
So what is actually going on here? Both HDBC and resource-pool are trying to close the connection from different threads? Why are they both doing it? HDBC should only do that on garbage collection, right? Does your patch make resource-pool do the same?
@basvandijk I traced the destroy call passed to createPool
, and it's only being called once. So it's very probably that the other call is being made from HDBC.
I haven't been able to reproduce the failure yet, but here is what I think is going on:
foreign import unsafe
(used in HDBC-postgresql's C bits) may not hold. Thus, their reference counting may go wrong and you're actually able to double-free (which this whole kerfuffle is meant to prevent, ironically)Since we can't know from the pool if the connection was already freed, but the pool is discarded anyways, it seems safe to just pass a no-op destroy action. Alternatively, destroyAllResources
should also be safe.
EDIT:
The above doesn't apply if you hold on to the pool -- using disconnect
as the destroy action should be safe then
Alternatively, destroyAllResources should also be safe.
+1
I already added this as recommended practice to the documentation of createPool last week.
We don't want to rely on the garbage collector to close scarce resources. Finalizers should only be used as a last resort.
@hesselink I think we agree that HDBC is to blame here right?
Then I think it's best to close this issue.
@kim thanks for your help!
@basvandijk It's not quite clear to me. Two finalizers try to close the same connection, but I don't think it's obvious which one is wrong. And it feels very bad to me that code that looks 'obviously correct' (createPool connect disconnect
) leads to a runtime crash. How do you think this should be fixed in HDBC?
How do you think this should be fixed in HDBC?
Well, they should consider reimplementing the refcounting of HDBC-postgresql (which appears to be very ancient code btw) in Haskell rather than C. Or at least guard the counter with a lock. As I wrote above, they seem to be relying on unsafe
foreign calls usually blocking the whole RTS - which is rather meaningless in a finalizer.
Two finalizers try to close the same connection, but I don't think it's obvious which one is wrong. And it feels very bad to me that code that looks 'obviously correct' (createPool connect disconnect) leads to a runtime crash. How do you think this should be fixed in HDBC?
I think resource-pool does the right thing here. It destroys the resource when the pool goes out of scope. That's what you expect from a pooling abstraction.
Since HDBC is providing a disconnect function, with which users can manually close connections, they should take care that their automatic finalization doesn't interfere with their manual finalization.
We're getting a malloc error when using resource-pool-0.2.2.0. With 0.2.1.1 we don't get the error. The error is:
This is on Mac OS X 10.9, using GHC 7.6.3. All other dependencies stay the same, so I'm pretty sure it's resource-pool.