Metamodel concurrency problems

vrurg commented 4 months ago

Rakudo's current Metamodel implementation is currently a high-risk zone when it comes to concurrency. One example is my recent fix submitted with PR rakudo/rakudo#5519.

Another can be found in Perl6/Metamodel/MethodContainer.nqp where hash- and array-typed attributes are used all across the code as-is with no protection whatsoever. For example, method cache is reading directly from %!cache while a parallel thread might be updating it.

There are other Metamodel type objects where hash and array attributes are used. They're not always marked with corresponding sigils as oftentimes a scalar attribute gets initialized with a hash or list.

vrurg commented 4 months ago

I don't have a solution to this. Only a few thoughts to share.

Diagnosis

The MRO case fix became possible with https://github.com/MoarVM/MoarVM/pull/1783 when, instead of ending up with memory corrupted by parallel hash operations, MoarVM started throwing legitimate exceptions with backtrace pointing exactly at the source of the bug.

Unfortunately, since I'm not an expert in MoarVM, I had to implement the PR in a way that disables speshing of hash allocations. This is, perhaps, the biggest contribution to the ~20-30% speed reduction I observe.

Another point is that arrays are likely to require the same protection mechanism and I barely able to implement this part too.

Atomic attribute access

Many problems could be solved with atomic ops (cas, primarily) would only the be available to NQP code. This is not currently the case, unfortunately. I'm not even sure if this is technically possible to have on representations other than native types.

NB I was considering atomizing attribute read/write operations, but this approach could only help in solving a handful of simple scenarios. Many cases of multi-step code, especially those involving cloning, couldn't be solved by simply re-arranging read/assign.

Locks and object construction protocol in NQP

The above mentioned fix for MRO involves utilization of a lock (NQPLock, in particular). Unfortunately, NQP is rather primitive language lacking support for attribute defaults. I.e. has $!lock = NQPLock.new; isn't an option. To get around this limitation I introduced a method-initializer that's is invoked from constructors.

So far so good, but now I started observing "no method 'protect' on NQPMu" errors. They're less frequent than crashes related to %!mro attribute used to be, but they happen. The confusing part is that code, where the failure takes place, always belong to a Metamodel class which does call the method initializer. This means that somehow constructor method wasn't involved when the type object was created. Unfortunately, I can't currently tell how is it even possible as my first suspect, mixins, doesn't get around the standard protocol of typeobject creation.

Either way, it is important to keep in mind that the metamodel has many NQP roles. Whereas a class can rely on new to do early initializations (BUILD is to be left alone as it is the executor of build plan), a role can only rely on its consuming class good will.

From this point of view, having something like TWEAK would be beneficial to NQP, except that Raku's TWEAK is a submethod, allowing us to have it implemented on roles (6.e only). There is no submethods in NQP. Implementing them would be another step forward towards Raku complexities. I'm not sure we need this. But neither I see better solution.

Locks???

Let's get back to the MethodContainer role. Three %! attribute, two @! ones. Their use is scattered across role code. How many locks do we need? What would be the performance cost?

This is where it would be very helpful to have a guarantee that attribute binding is atomic. I.e. my %h = %!hash wouldn't crash nor would %h end up being bound to a broken hash simply because %!hash := nqp::hash() is being executed in a parallel thread simultaneously.

NQP Metamodel

As a matter of fact, NQP metamodel is not much different from Raku. @lizmat spotted a similar exception, thrown by NQPClassHOW.

raiph commented 4 months ago

Is it reasonable to discuss switching the metamodel to be/use actors, not classs?

lizmat commented 4 months ago

Is it reasonable to discuss switching the metamodel to be/use actors, not classs?

The metamodel is still in NQP land, which limits such HLL concepts :-(

lizmat commented 4 months ago

I'm trying to set up some scripts to try to force these issues out. My first attempt is to check whether you can safely clone an nqp::hash that is being written to by another thread:

use nqp;

my $a := nqp::hash;
start {
    my int $i;
    while ++$i {
        nqp::bindkey($a, $i, 1);
    }
}

await start {
    my int $i;
    while ++$i < 10000 {
        my $b := nqp::clone($a);
    }
}

say nqp::elems($a);

It appears not :-(

MoarVM oops: MVM_str_hash_iterator_next_nocheck called with a stale hashtable pointer

lizmat commented 4 months ago

Similarly, you cannot clone an NQP list that is being written to by another thread:

use nqp;

my $a := nqp::list;
start {
    my int $i;
    while ++$i {
        nqp::bindpos($a, $i, 1);
    }
}

await start {
    my int $i;
    while ++$i < 10000 {
        my $b := nqp::clone($a);
    }
}

say nqp::elems($a);

zsh: segmentation fault

lizmat commented 4 months ago

Looking at the MetaModel code, specifically the ClassHow, I wonder whether a short-time solution could be a MetaModel Interpreter Lock (akin to Python's GIL).

lizmat commented 4 months ago

Alternately, we could maybe get by in a lot of places that use caching by making sure that nqp::clone could be made thread-safe.

niner commented 4 months ago

It's not cloning that's not thread safe, it's modification. Modifying a hash or even an array is an involved process containing multiple steps. The safe times are only before you execute the first step or after executing the last step. In between the data structure is internally inconsistent and trying to read from it can result in all kinds of errors from just wrong results to crashes because we try to dereference an invalid pointer.

With complex data structures in general you can have either fast or thread safe. And if you need thread safety there are different ways to get that and which one is best depends on your exact use case. E.g. you can simply require every operation on the data structure to acquire a lock, i.e. serializing access. That's not a bad start. But for example when the data structure is written to only very rarely and read a lot, then it may be better to create a full copy of the whole thing, modify that and then atomically replace a pointer to the actual data when it's ready for reading. This way the read code does not have to take a costly lock. Note however that the write code still requires a lock. Otherwise you would end up with a missing writes problem.

This is the clone + replace pattern that we already use in the meta model. Of course this requires that the data structure is never ever modified directly and every modification only happens on a private clone before it gets published (i.e. the pointer replaced). It also depends on pointer replacing (or attribute assignment in our case) being an atomic operation.

lizmat commented 4 months ago

It's not cloning that's not thread safe, it's modification.

In my examples, the data structure is only cloned, not written to or read from in the "other" threads.

niner commented 4 months ago

But cloning is reading. You have to read the full data structure to be able to make a clone of it. So you have one thread writing and another thread reading with no protection. This will lead to exactly what I described: a reader stumbling over an internally inconsistent data structure because only some of the steps required to e.g. add a new key have been done yet and other required ones haven't.

But yes, my initial sentence is not great in hindsight, because it's really the combination of concurrent read+write that's not thread safe. My point was that while there are solutions that do not require any modifications of the reading code, all solutions require adaptions in the writing code.

The classic solution that also modifies readers would be protecting the data structure with a lock:

use nqp;

my $a := nqp::list;
my $l = Lock.new;
start {
    my int $i;
    while ++$i {
        $l.protect: {
            nqp::bindpos($a, $i, 1);
        }
    }
}

await start {
    my int $i;
    while ++$i < 10000 {
        $l.protect: {
            my $b := nqp::clone($a);
        }
    }
}

say nqp::elems($a);

The solution that does not require locking in the reader uses clone+atomically replace:

use nqp;

my $a := nqp::list;
start {
    my int $i;
    while ++$i {
        my $c := nqp::clone($a);
        nqp::bindpos($c, $i, 1);
        $a := $c;
    }
}

await start {
    my int $i;
    while ++$i < 10000 {
        my $b := nqp::clone($a);
    }
}

say nqp::elems($a);

niner commented 4 months ago

Note that my second example still only allows for a single concurrent writer. If you have multiple writers, you need a lock in the write path:

use nqp;

my $a := nqp::list;
my $l := Lock.new;
sub modifier {
    my int $i;
    while ++$i {
        $l.protect: {
            my $c := nqp::clone($a);
            nqp::bindpos($c, $i, 1);
            $a := $c;
        }   
    }
}
start modifier for ^10;

await start {
    my int $i;
    while ++$i < 10000 {
        my $b := nqp::clone($a);
    }
}

say nqp::elems($a);

vrurg commented 4 months ago

Note that my second example still only allows for a single concurrent writer. If you have multiple writers, you need a lock in the write path

And even this example isn't totally safe because simple binding itself remains non-atomic. Not sure about lexicals, but I was observing crashes related to non-atomized update of an attribute that led me to using nqp::atomicbindattr in Stash code. Although I have doubts in how much efficient the binding alone is without atomic fetching.

I wonder whether a short-time solution could be a MetaModel Interpreter Lock (akin to Python's GIL)

There are two questions about this approach. First of all, locks are not cheap. How would this impact the overall performance of Rakudo?

Second and the complicated one: how would you impose such a lock? Don't forget that some parts of the metamodel are invoked by the VM itself too. It doesn't make a distinction between Raku and NQP because all it deals with are code objects.

lizmat commented 4 months ago

Don't forget that some parts of the metamodel are invoked by the VM itself too.

That's news to me. I assume all is done through methods in the metamodel, is it not?

vrurg commented 4 months ago

Don't forget that some parts of the metamodel are invoked by the VM itself too.

That's news to me. I assume all is done through methods in the metamodel, is it not?

Not methods only, but mostly. What comes to mind:

nqp::istype can call 'accepts_type'
not exactly metamodel, but the bootstrap set many parameters with nqp::sethllconfig where some are actually code blocks
similarly, nqp::setcontspec to register Scalar capabilities

With the last two cases, methods are invoked by NQP code of the registered callbacks.

patrickbkr commented 4 months ago

Currently I don't see an alternative to just finding all problematic cases and making each thread safe using locks.

As a rough estimate, how many attributes are we talking about that can currently observe unwanted concurrent access? Or do we simply not know and the biggest issue is finding the problematic cases?

vrurg commented 4 months ago

Currently I don't see an alternative to just finding all problematic cases and making each thread safe using locks.

The matter is very much about having the necessary tools to locate the cases.

Or do we simply not know and the biggest issue is finding the problematic cases?

Pretty much the way you say. Some scalar attributes are initialized with collection objects making it harder to collect statistics in a straightforward manner.

Aside of that, fixes might be more complicated, than it is anticipated at the first glance. Also, some potentially hot paths are better get solutions that wouldn't affect performance or, at least, degrade it within acceptable limits.

lizmat commented 4 months ago

With all of the work on NQP I did the last week, I haven't had a REA harvester crash in the past 3 days. Before that work, it would crash about once every 1.5 day.

Raku / problem-solving