Thread-safe via `Arc<Mutex<...>>`

KWARC / rust-libxml

Rust wrapper for libxml2

https://crates.io/crates/libxml

MIT License

76 stars 38 forks source link

Thread-safe via `Arc<Mutex<...>>` #47

Closed dginev closed 5 years ago

dginev commented 5 years ago

First stab at thread safety ( #46 ).

I would be quite lucky if you have some time to review and think about the general problem @triptec . Or if you know of other crates that may have solved a similar issue nicely, that we can copy a method from.

In this PR I added the "best practice" from the official rust book, which is to hide the libxml node pointer behind even more indirection+guarantees:

type NodeRef = Arc<Mutex<_Node>>;

And you could still pretend this is a regular rust data structure and call node.set_attribute(...) or similar on a mutable node, without ever realizing there is an atomic reference counter and a mutex behind the Node struct. In theory this enables parallel traversals / mutations such as the mock test added here, to the tune of:

use rayon::prelude::*;
// ...
nodes.into_par_iter().for_each(|mut node| {
   node.set_attribute("key","val");
}

and get the parallel processing for free via the rayon iterator, while still remaining safe for mutation thanks to the Arc<Mutex<...>>.

The two main bits that are worrying me are:

This loses performance for single-threaded applications, and I assume most crate users will focus on exactly that classic application type. Maybe it makes sense to have a separate set of parallel structs, such as NodeAtomic and DocumentAtomic, which implement Send+Sync, instead of overriding the basic structs, so that single-threaded users are not impacted... but the boilerplate will become huge if I try to maintain two implementations...

Sanity. I am quite troubled that I had to manually "claim" the final structs are Sync+Send via:
```
#[derive(Clone, Debug)]
pub struct Node(NodeRef);
```

unsafe impl Sync for Node {} unsafe impl Send for Node {}



I would sleep much better if Rust could auto-derive those, based on the `Arc<Mutex<...>>` that is now wrapping `NodeRef`, but I am unsure if I am missing a piece to achieve that, or if that is not even possible for custom structs... Feedback welcome.

P.S. The rayon test, and rayon dependency, are just here for purposes of example, I would remove both before merging any version of this PR, as I don't think rayon should be a dependency for the generic libxml crate.

dginev commented 5 years ago

If I understand the Rust book correctly: https://doc.rust-lang.org/nightly/book/ch19-01-unsafe-rust.html#implementing-an-unsafe-trait

If we have a raw pointer, irrespective of how many layers of guarantees we wrap it with, the compiler can never verify if it is Send+Sync, the way it could for native-derived types. Thus the onus is on the module author to manually make the claim via an unsafe impl Send/Sync for X {}, as I have done here.

In which case the main consideration is how to stay lean in performance?

dginev commented 5 years ago

Got some encouragement (on the Rust Discord server) that atomics aren't felt that strongly in single-threaded contexts, so maybe this is a sensible approach after all... Will run a benchmark on some large XML snippet to see if any visible slowdown occurs ...

dginev commented 5 years ago

I had a sneaking suspicion this would be the case - a naive benchmark shows a parallel iterator DFS to be much slower than the single-threaded version. I pushed the example at b2fff0d2428c55b81ea53bddd6939894e45ef3f9

and the first benchmark results with Criterion were:

     Running target/release/deps/parsing_benchmarks-e718803c0af36e8a
single thread           time:   [1.4070 s 1.4151 s 1.4187 s]                           

multi thread            time:   [4.6913 s 4.6964 s 4.7012 s]

This is particularly shocking since I have 32 logical threads available on my 1950x threadripper CPU, so there may be a significant mistake in the way I've implemented the benchmark... Will dig in and report back. Also, this is not yet benchmarking the new "atomic" single-threaded performance vs the old Rc performance of master, gearing up to that next. I am running the benchmark over the rather elaborate 2 million lines of XML I linked to in the previous comment.

dginev commented 5 years ago

Interestingly enough, I believe the slower performance comes from waiting for a Mutex lock to become available. The place where it seems most visible ("hottest path") is at Node::wrap, where we need a mutable lock over the document, to record the currently wrapped node pointer. Safety getting into the way of performance ...

The other aspect is that I am benchmarking a simple sum, which is a near-instant operation, and there is little load to split between the threads. If we make the assumption that instead each operation takes a more significant chunk of time, the multi-threaded code will appear a lot more useful.

triptec commented 5 years ago

I might find some time this weekend

On 22 Mar 2019, at 06:34, Deyan Ginev notifications@github.com wrote:

Interestingly enough, I believe the slower performance comes from waiting for a Mutex lock to become available. The place where it seems most visible ("hottest path") is at Node::wrap, where we need a mutable lock over the document, to record the currently wrapped node pointer. Safety getting into the way of performance ...

The other aspect is that I am benchmarking a simple sum, which is a near-instant operation, and there is little load to split between the threads. If we make the assumption that instead each operation takes a more significant chunk of time, the multi-threaded code will appear a lot more useful.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dginev commented 5 years ago

Thanks! I'll do some more iterations on the PR until then. Out of curiosity I reran my rayon benchmark with 32,16,8,4,2 and 1 threads, and recorded the results in my last commit. It's a very explicit way to confirm that adding more threads adds more overhead, and my best estimate continues to be that the document .lock in Node::wrap is the main culprit for the waiting overhead.

One may wonder if we really need a mutex for threading within the same Document object, but omitting the mutex leads to an immediate runtime panick as the RefCell holding the document gets mutably borrowed multiple times, by each of the node threads.

So, we either need a redesign that makes Node::wrap lock-free (w.r.t document), or this PR will be useful only for jobs which have very significant thread-internal tasks, for which paying the overhead is still worth the thread speedup.

dginev commented 5 years ago

Commenting out the two uses of document in Node::wrap (and thus losing memory safety!) I can confirm we're back to sane multi-threading territory, where more threads = faster performance:

single thread DFS count time:   [766.91 ms 768.93 ms 771.42 ms]                                  
                        change: [-46.972% -46.786% -46.604%] (p = 0.00 < 0.05)
                        Performance has improved.

single thread DFS count+length                                                                          
                        time:   [1.1354 s 1.1364 s 1.1374 s]
                        change: [-34.547% -34.357% -34.194%] (p = 0.00 < 0.05)
                        Performance has improved.

multi thread DFS count  time:   [692.04 ms 693.98 ms 695.69 ms]                                 
                        change: [-55.673% -55.559% -55.446%] (p = 0.00 < 0.05)
                        Performance has improved.

multi thread DFS count+length                                                                          
                        time:   [687.61 ms 688.83 ms 690.10 ms]
                        change: [-61.991% -61.896% -61.801%] (p = 0.00 < 0.05)
                        Performance has improved.

But even in the case of single-threaded classic use of the wrapper, with no rayon participation at all, one can spot a 30-40% speedup on this particular benchmark. Food for thought in the "price of safety" heading.

dginev commented 5 years ago

Some thoughts while I continue to think on the data gathered here. The main reason for me starting this PR is that I have a collection of over 1 million HTML files that I need to very quickly iterate over and perform various tasks on. The task I am running this week is in fact very similar to the benchmark in this PR -- I do a traversal of the nodes, and record statistics on which node/attribute/value triples are in use, which then inform external tool development.

So my initial question was - can I speed a traversal up, knowing that I already have all of the node data eagerly loaded in memory (in C's libxml2), and knowing I have a read-only use case, where mutability wasn't ever a question/need. And it seems like the answer is nuanced, since we pay in runtime performance for certain aspects of safety... Then again, the bookkeeping design was designed in a single-threaded context, and may have a logical refactor that is both safe and fast in a multi-threaded use... Hm...

Edit 1:

One realization seems to be that for read-only/immutable operations, such as traversal, data races are not an issue, and the locking mutexes is unnecessary -- we can guarantee correctness without locking the document, if we know apriori no mutations are possible.
Of course, if we had an adversarial setup with a read-only thread and a mutating thread, you'd like the read-only traversal to lock the document while traversing it, to ensure the mutating thread won't change the data while traversing, thus leading to an inconsistent read. I think the "atomic" intuition here is correct, and we can allow authors to do a document.lock() explicitly, to guarantee thread consistency.
In effect, any mutable method on Node or Document would thus require a lock on the entire document, as we don't have a good enough handle on the data model to be able to only lock the current subtree.
- But simultaneously, methods that simply transport C nodes to wrapped Rust _Nodes, shouldn't require the entire document to be locked, as they don't mutate it, they simply expose already existing underlying data.
- So, with this considered, I would propose the Node::node_ptr_mut to change. In a way where it doesn't return the raw pointer, but instead accepts a closure, with the raw pointer as a single argument, and executes the mutable payload on a locked document.
- The issue of locking the document during Node::wrap needs to be addressed separately, especially since it is the biggest source of slowdown. I think the only reasonable escape that addresses performance is to not record the node in the document when it is wrapped, but instead to record it when it is mutated or dropped. In effect, I am thinking to wrap an Arc<> around the raw pointer, or something to that degree... And bookkeeping of the Node memory exclusively on Drop, locking the Document when nodes drop and recording them, so that a single pass of libxml2-level free()-ing can occur when the document drops.

dginev commented 5 years ago

So, I pushed a new branch with an entire new experiment that implements my last comment.

A good place to read into the branch is here

The two ideas being that 1) each mutable use of a Node pointer will enforce a document lock and 2) Nodes just wrap over their data, and the only Arc<Node> we need is in the Document bookkeeping, which is still vital for proper memory deallocation on Drop.

What I lost as a feature in that branch: If you create two rust Node objects that point to the same underlying C node, and call mutating methods on both of them, Rust will no longer realize this is a mutability conflict in runtime. This is the case since I had to go back (for now) to the simple wrappers we had before the current Rc<Refcell<...>>, in order to drop all the locking overhead and compare runtimes.

Ok, so benchmarking this lighter approach of bookkeeping/locking-on-mut, on the same examples as above, yields:

single thread DFS count time:   [459.57 ms 459.82 ms 460.33 ms]                                  

single thread DFS count+length                                                                          
                        time:   [710.99 ms 713.95 ms 717.42 ms]

multi thread DFS count  time:   [636.26 ms 637.30 ms 638.78 ms]                                 

multi thread DFS count+length                                                                          
                        time:   [619.51 ms 620.84 ms 622.04 ms]

Where we already see the multi-threaded version overtakes the single thread for the count+length payload, which is near-trivial, and indicates the gains will be significantly more visible in real-world applications. But more importantly, times are near 2x faster than the first naive implementation of the PR, which is quite considerable...

I think I will sleep on the current state, and see if I can recover the mutability_guards test without big runtime penalties... Interesting!

triptec commented 5 years ago

I haven’t had the time to look at those pr’s yet but I’ve been thinking about it a little and wonder a) do we optimize for reading or writing and considering that b) is there any crates that has other primitives than rwlock and mutex that might be a better fit for the problem. I’m thinking of parking_lot, crossbeam and others I haven’t heard of. I’ve never used any multithreaded things other than just using crates that might use it under the covers so yeah

dginev commented 5 years ago

Certainly a deep and open-ended rabbit hole, since this is taking us beyond the simple wrap over libxml2, and we can take the crate in several different directions (and make several different kinds of mistakes...).

The ideal goal would be that you only pay for safety when you are doing something unsafe - so that clean reads, and orderly mutability, work with no runtime cost associated, until you make a mistake -- or the next best formulation -- until you are at a mistake-prone piece of logic.

To make the above point more specific: Any changes to the tree have to be guarded, but if I was designing a library from scratch, I would probably make it possible to add guards at different levels of granularity. Examples of what I would have in mind (in a perfect world):

If you are setting an attribute, only lock the changed node, allow mutability in any other tree location - including children, to take place in sibling threads
if you are CRUD-ing a child/subtree, lock only the affected subtree of the document, and allow sibling threads to have mutability anywhere else on the document fragment.
if you are reading, don't pay any runtime overhead for safety

Then a tree with N subtrees could have N threads do simultaneous localized changes at essentially constant time. And since I have already thought of that, any solution that I arrive for our current very humble wrapper crate feels like sour grapes - we end up paying various unnecessary overhead due to the "retroactive patching" design, or we lose safety in unpleasant ways (that we even already test for).

So back to humble goals, I would summarize as:

enable threading over libxml trees, e.g. ensure rayon iterators work reliably
enable "sane runtime overhead", where 1) single-threaded performance does not visibly degrade from the current master and 2) it is diffiuclt, if not impossible to end up in a deadlock scenario, or a scenario where thread switching overhead visibly hurts performance.
- stay reasonably safe, have a deadlock-free design for the public API, and document the obvious possible pitfalls

I am starting to feel somewhat comfortable with the compromise solution on the new branch I made, but will still need at least a week of thinking and tinkering to feel "solid" about it ...

As to using different synchronization primitives - maybe that helps towards getting closer to the "good design" goals? I will read a little into the alternative crates out there, I also haven't spent serious time studying the ecosystem.

dginev commented 5 years ago

Closing this experiment in favor of #53 , will postpone thinking of parallel processing with mutability for a while longer (until someone comes and has a great solution!)