Fix non-deterministic segfaults caused by concurrent garbage collector

maurobringolf commented 3 years ago

I created a reproduction for #27 which works reliably on my machine and GitHub actions so far. It crashes with different errors though, here are some examples:

double free or corruption (fasttop)

malloc_consolidate(): invalid chunk size

terminated by signal SIGSEGV (Address boundary error)

ASSERTION VIOLATION
File: ../src/ast/ast.cpp
Line: 450
UNEXPECTED CODE WAS REACHED.

free(): corrupted unsorted chunks

maurobringolf commented 3 years ago

Observation: Removing all free operations (like https://github.com/IagoAbal/haskell-z3/issues/27#issuecomment-793055118) removes the segfaults.
Observation: Just freeing ref counted objects (mkC2hRefCount) but not contexts is enough to trigger segfaults.

IagoAbal commented 3 years ago

Nice work! I think your observations are all in line with my expectations. Decreasing the ref count for an object may be modifying the context itself, so one cannot do these operations concurrently. I still believe that introducing a context lock and using it to serialize finalizers may fix the issue. We already take care of not freeing the context before its objects.

Do you want to give this a try or should I?

maurobringolf commented 3 years ago

I thought the same and tried a few variations of locking Context. I could not get it to work by locking only inside the finalizers, so I ended up locking inside withContext as well. To do that I needed to change to the re-entrant version RLock, otherwise it deadlocks instantly.

The current version passes the test but is clearly undesirable because it forces complete serialization and has way too many locking operations. Do you have any ideas?

I think https://github.com/IagoAbal/haskell-z3/issues/27#issuecomment-792654862 is not applicable, but maybe I am wrong.

maurobringolf commented 3 years ago

Does it fix the segfaults?

It does, for this test case at least. I am a bit concerned about locking the context on every API call, don't you think this is a problem?

If we decide to use this solution I would include some documentation in the readme about thread safety and locking, because otherwise someone writing concurrent code might end up getting serialized by our library without realizing it (If I am understanding it correctly).

IagoAbal commented 3 years ago

As far as I understood Z3 contexts are not thread-safe so if someone is trying to manipulate the same context concurrently is doing it wrong anyways. In our case the GC is doing exactly that and hence people get segfaults (c.f., #27). But we don't need to go this far, it's good enough to serialize the finalizers and document that Z3.Base.Context is not thread safe.

maurobringolf commented 3 years ago

t's good enough to serialize the finalizers

Based on my tests it is not: the testcase still segfaults. According to my understanding this depends whether GC/finalizers ever run concurrently with the program itself. I found some references to concurrent garbage collection in the GHC docs, so I think we need to control for that.

IagoAbal commented 3 years ago

Based on my tests it is not: the testcase still segfaults. According to my understanding this depends whether GC/finalizers ever run concurrently with the program itself.

That is weird! The GC "stops the world" by default, meaning the program doesn't run while the GC runs. I now see the --nonmoving-gc option, but it's not the default and I don't see you enabling it.

maurobringolf commented 3 years ago

It turns out that -N implies --nonmoving-gc from the docs:

Setting -N also has the effect of enabling the parallel garbage collector

and also:

Under this collection strategy oldest-generation garbage collection can proceed concurrently with mutation.

So the compiler options in this PR and bug reports result in concurrent garbage collection, which results in concurrent operations on the not-thread-safe context objects. When adding --copying-gc to this PR's configuration it runs fine without any locks. As far as I understand the docs this is not guaranteed, because finalizers might still run in concurrently. But if we lock just the finalizers, then we should be safe in this GC.

So have three ideas where we can go from here:

Lock every context access unconditionally. Does not change any API, but incurs locking overhead even when not needed. This is what is currently implemented in this PR.
Offer a Context option without garbage collection, basically an API for something like https://github.com/IagoAbal/haskell-z3/issues/27#issuecomment-793055118. We could then document that this should be used when using concurrent GC.
Automate option 2, somehow conditionally compile the locks by inspecting which garbage collector is used in current compilation. I have not found the necessary APIs though and don't even know if that is possible.
Document that we do not support the --nonmoving-gc and only lock finalizers.

I am trying to figure out whether option 3 is feasible, because I think it makes the most sense.

IagoAbal commented 3 years ago

I believe that parallel GC may refer to both the copying GC and the "non moving" GC. I am a bit surprised about this behavior because copying is supposed to be the default. But anyhow, let's assume the worst case scenario here.

I am more fond of option 1, I would just like to avoid the dependency with concurrent-extra. (I would be happy to depend on it if it were actively maintained, but it doesn't seem to be.) We only need a minimal implementation of RLock on top of MVar, one can basically copy the way it's done in concurrent-extra. I can help with that if you want.

One could also rely on a single MVar, but then we would have to be very careful not introducing potential deadlocks. Perhaps some code would have to be slightly restructured. So overall it may be more work than implementing an RLock and long-term a potential maintenance headache.

maurobringolf commented 3 years ago

Okay, I think opting for the safest version is a good idea. So I adopted Lock and RLock and kept all synchronization in place. It is probably not the best solution yet, but worth merging to get rid of the segfaults.

Maybe I will try to create a multi-threaded example for further testing and documentation.

maurobringolf commented 3 years ago

Fixed the copy paste mistakes, thanks for catching!

Is the new Concurrency section in the readme understandable?

A new release is a good idea, in the readme I referred to it as 408.3 is that correct?

IagoAbal commented 3 years ago

Thanks for the great work!

IagoAbal commented 3 years ago

BTW, I'm planning to release 408.3 this coming weekend, perhaps on Sunday.

maurobringolf commented 3 years ago

This was great, I learned a lot doing this!

A release with this fix is a good idea. The next thing I planned to tackle was some improvements in compiler compatibility such that there are no warnings with any GHC version (or just less, I currently get quite a number of them) and also testing it with the new GHC version 9. I think I will have time for this before Sunday, but if not it is not critical at all and can go in a later release.

IagoAbal commented 3 years ago

Sorry I've been quite busy, hope to release this by next weekend...

maurobringolf commented 3 years ago

No worries, I have been working on more testing and further API extensions in the meantime.

IagoAbal / haskell-z3

Fix non-deterministic segfaults caused by concurrent garbage collector #61