malloc error with resource-pool 0.2.2.0

hesselink commented 10 years ago

We're getting a malloc error when using resource-pool-0.2.2.0. With 0.2.1.1 we don't get the error. The error is:

manuel(61832,0x10c181000) malloc: *** error for object 0x7000000000000000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug

This is on Mac OS X 10.9, using GHC 7.6.3. All other dependencies stay the same, so I'm pretty sure it's resource-pool.

basvandijk commented 10 years ago

My hunch is that it's related to: https://github.com/bos/pool/commit/73b14736dfe78621e6299517d6477ad08bb89354. @kim Do you have an idea what is going wrong?

basvandijk commented 10 years ago

@hesselink is it possible you can debug your program with gdb so we can get a stack trace?

hesselink commented 10 years ago

We tried that briefly, but there seemed to be no useful info in the stack trace, no symbols. I just did it again, here's what I got (manuel is the name of the executable):

manuel(27727,0x106a81000) malloc: *** error for object 0x2000000000000000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x180b of process 27727]
0x00007fff93103866 in ?? ()
(gdb) bt
#0  0x00007fff93103866 in ?? ()
#1  0x00007fff9205335c in ?? ()
#2  0x0000000000000000 in ?? ()

hesselink commented 10 years ago

Using lldb I got some more useful stuff:

manuel(28308,0x107981000) malloc: *** error for object 0xa000000000000000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Process 28308 stopped
* thread #9: tid = 0x492879, 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
    frame #0: 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill + 10:
-> 0x7fff93103866:  jae    0x7fff93103870            ; __pthread_kill + 20
   0x7fff93103868:  movq   %rax, %rdi
   0x7fff9310386b:  jmpq   0x7fff93100175            ; cerror_nocancel
   0x7fff93103870:  ret
(lldb) bt
* thread #9: tid = 0x492879, 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff93103866 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff9205335c libsystem_pthread.dylib`pthread_kill + 92
    frame #2: 0x00007fff8f5cab1a libsystem_c.dylib`abort + 125
    frame #3: 0x00007fff970b207f libsystem_malloc.dylib`free + 411
    frame #4: 0x000000010410ed94 libpq.dylib`freePGconn + 164
    frame #5: 0x0000000101da2517 manuel`PQfinish_app + 23
    frame #6: 0x0000000101d76032 manuel`___lldb_unnamed_function480241$$manuel + 18

basvandijk commented 10 years ago

Ok, so it looks like the PostgreSQL connection is destroyed twice.

@hesselink when does the bug occur? I would suspect when the Pool gets garbage collected. Or does it happen sooner?

Another question: do you call destroyAllResources from your program?

If it happens on garbage-collection time it might be some strange interaction between purging the local pools and the reaper thread. Although I can't explain why since both purgeLocalPool and reaper use STM stransactions to manage the list of idle resources.

kim commented 10 years ago

@basvandijk you think newTVarIO is actually unsafe? that'd be pretty bad news. @hesselink maybe you could try replacing the LocalPools initialisation code with something like:

-- ...
-- line 160:
localPools <- V.replicateM numStripes $ do
  fin <- newIORef ()
  lp  <- atomically $ liftM2 LocalPool (newTVar 0) (newTVar [])
  return $ lp fin
-- ...

At least we could then see if Bas' suspicion of some kind of race holds.

hesselink commented 10 years ago

@basvandijk The bug occurs when our application starts up, around the point it makes a connection to fake-s3 using the aws library. If I don't make this connection, the crash doesn't occur. Before this, we make an (pooled) connection to the database, using a new pool that's garbage afterwards. If I remove this call, the crash goes away as well.

We don't call destroyAllResources, although I can't speak for all our transitive dependencies.

@kim I tried that code, it still crashes.

hesselink commented 10 years ago

I just did a git bisect, and it is indeed 73b14736dfe78621e6299517d6477ad08bb89354 that is the culprit.

basvandijk commented 10 years ago

We don't call destroyAllResources, although I can't speak for all our transitive dependencies.

Ok, I don't think your transitive dependencies use it either since it's a very new addition (I added it less than a week ago).

I just did a git bisect, and it is indeed 73b1473 that is the culprit.

Ok that really helps!

Could you trace the calls to your destroy function (i.e. the function you pass to createPool)? And see if a double-free is really going on.

kim commented 10 years ago

@hesselink are you using HDBC-postgresql?

They seem to do some strange things with finalizers. Particularly, I think this won't ever return 0: https://github.com/hdbc/hdbc-postgresql/blob/master/Database/HDBC/PostgreSQL/Connection.hsc#L202 -- which is assumed here: https://github.com/hdbc/hdbc-postgresql/blob/master/Database/HDBC/PostgreSQL/Connection.hsc#L59

The thing is, the purpose of my patch was to ensure that the resources actually get destroyed. We had some issues with leaking file descriptors and it turned out that the reaper thread would get killed immediately and resources would never be finalized (cf. #15). Could it be that in your case actually closing the postgres connection would always crash the program?

kim commented 10 years ago

Please disregard the pointers to the hdbc-postgresql source, that's unrelated (just quickly skimming the source).

After digging a little more into the HDBC source I found even more acrobatics involving weak refs. It looks quite likely to me that finalizers from both resource-pool and HDBC are trying to free the connection concurrently. I'm afraid there's not so much we can do from Data.Pool, though. Maybe just pass a no-op destroy action and let HDBC deal with the cleanup?

hesselink commented 10 years ago

@kim Yes, we're using HDBC-postgresql.

@basvandijk Our destroy function is just disconnect from HDBC.

hesselink commented 10 years ago

@kim Will that actually work? I'd rather not run out of DB connections at some point because they are not being closed properly.

So what is actually going on here? Both HDBC and resource-pool are trying to close the connection from different threads? Why are they both doing it? HDBC should only do that on garbage collection, right? Does your patch make resource-pool do the same?

hesselink commented 10 years ago

@basvandijk I traced the destroy call passed to createPool, and it's only being called once. So it's very probably that the other call is being made from HDBC.

kim commented 10 years ago

I haven't been able to reproduce the failure yet, but here is what I think is going on:

my patch (73b1473) ensures that the destroy action is actually run on GC (that wasn't necessarily the case before)
HDBC tries hard to free connections using finalizers as well. HDBC-postgresql even goes as far as implementing some kind of custom reference counting scheme in C, which I haven't understood the need for yet.
you wrote that you're making some one-off database calls, after which both the pool and anything in HDBC goes out of scope
my guess is that, if the GC determines it should run both the finalizers from resource-pool and HDBC, it does so concurrently (and in unspecified order) and the "stop the world" semantics of foreign import unsafe (used in HDBC-postgresql's C bits) may not hold. Thus, their reference counting may go wrong and you're actually able to double-free (which this whole kerfuffle is meant to prevent, ironically)

Since we can't know from the pool if the connection was already freed, but the pool is discarded anyways, it seems safe to just pass a no-op destroy action. Alternatively, destroyAllResources should also be safe.

EDIT: The above doesn't apply if you hold on to the pool -- using disconnect as the destroy action should be safe then

basvandijk commented 10 years ago

Alternatively, destroyAllResources should also be safe.

+1

I already added this as recommended practice to the documentation of createPool last week.

We don't want to rely on the garbage collector to close scarce resources. Finalizers should only be used as a last resort.

basvandijk commented 10 years ago

@hesselink I think we agree that HDBC is to blame here right?

Then I think it's best to close this issue.

@kim thanks for your help!

hesselink commented 10 years ago

@basvandijk It's not quite clear to me. Two finalizers try to close the same connection, but I don't think it's obvious which one is wrong. And it feels very bad to me that code that looks 'obviously correct' (createPool connect disconnect) leads to a runtime crash. How do you think this should be fixed in HDBC?

kim commented 10 years ago

How do you think this should be fixed in HDBC?

Well, they should consider reimplementing the refcounting of HDBC-postgresql (which appears to be very ancient code btw) in Haskell rather than C. Or at least guard the counter with a lock. As I wrote above, they seem to be relying on unsafe foreign calls usually blocking the whole RTS - which is rather meaningless in a finalizer.

basvandijk commented 10 years ago

Two finalizers try to close the same connection, but I don't think it's obvious which one is wrong. And it feels very bad to me that code that looks 'obviously correct' (createPool connect disconnect) leads to a runtime crash. How do you think this should be fixed in HDBC?

I think resource-pool does the right thing here. It destroys the resource when the pool goes out of scope. That's what you expect from a pooling abstraction.

Since HDBC is providing a disconnect function, with which users can manually close connections, they should take care that their automatic finalization doesn't interfere with their manual finalization.

bos / pool

malloc error with resource-pool 0.2.2.0 #18