KWARC / rust-libxml

Rust wrapper for libxml2
https://crates.io/crates/libxml
MIT License
76 stars 38 forks source link

Thread-safe via `Arc<Mutex<...>>` #47

Closed dginev closed 5 years ago

dginev commented 5 years ago

First stab at thread safety ( #46 ).

I would be quite lucky if you have some time to review and think about the general problem @triptec . Or if you know of other crates that may have solved a similar issue nicely, that we can copy a method from.

In this PR I added the "best practice" from the official rust book, which is to hide the libxml node pointer behind even more indirection+guarantees:

type NodeRef = Arc<Mutex<_Node>>;

And you could still pretend this is a regular rust data structure and call node.set_attribute(...) or similar on a mutable node, without ever realizing there is an atomic reference counter and a mutex behind the Node struct. In theory this enables parallel traversals / mutations such as the mock test added here, to the tune of:

use rayon::prelude::*;
// ...
nodes.into_par_iter().for_each(|mut node| {
   node.set_attribute("key","val");
}

and get the parallel processing for free via the rayon iterator, while still remaining safe for mutation thanks to the Arc<Mutex<...>>.

The two main bits that are worrying me are:

  1. This loses performance for single-threaded applications, and I assume most crate users will focus on exactly that classic application type. Maybe it makes sense to have a separate set of parallel structs, such as NodeAtomic and DocumentAtomic, which implement Send+Sync, instead of overriding the basic structs, so that single-threaded users are not impacted... but the boilerplate will become huge if I try to maintain two implementations...
  1. Sanity. I am quite troubled that I had to manually "claim" the final structs are Sync+Send via:
    
    #[derive(Clone, Debug)]
    pub struct Node(NodeRef);

unsafe impl Sync for Node {} unsafe impl Send for Node {}



I would sleep much better if Rust could auto-derive those, based on the `Arc<Mutex<...>>` that is now wrapping `NodeRef`, but I am unsure if I am missing a piece to achieve that, or if that is not even possible for custom structs... Feedback welcome.

P.S. The rayon test, and rayon dependency, are just here for purposes of example, I would remove both before merging any version of this PR, as I don't think rayon should be a dependency for the generic libxml crate.
dginev commented 5 years ago

If I understand the Rust book correctly: https://doc.rust-lang.org/nightly/book/ch19-01-unsafe-rust.html#implementing-an-unsafe-trait

If we have a raw pointer, irrespective of how many layers of guarantees we wrap it with, the compiler can never verify if it is Send+Sync, the way it could for native-derived types. Thus the onus is on the module author to manually make the claim via an unsafe impl Send/Sync for X {}, as I have done here.

In which case the main consideration is how to stay lean in performance?

dginev commented 5 years ago

Got some encouragement (on the Rust Discord server) that atomics aren't felt that strongly in single-threaded contexts, so maybe this is a sensible approach after all... Will run a benchmark on some large XML snippet to see if any visible slowdown occurs ...

dginev commented 5 years ago

I had a sneaking suspicion this would be the case - a naive benchmark shows a parallel iterator DFS to be much slower than the single-threaded version. I pushed the example at b2fff0d2428c55b81ea53bddd6939894e45ef3f9

and the first benchmark results with Criterion were:

     Running target/release/deps/parsing_benchmarks-e718803c0af36e8a
single thread           time:   [1.4070 s 1.4151 s 1.4187 s]                           

multi thread            time:   [4.6913 s 4.6964 s 4.7012 s]                          

This is particularly shocking since I have 32 logical threads available on my 1950x threadripper CPU, so there may be a significant mistake in the way I've implemented the benchmark... Will dig in and report back. Also, this is not yet benchmarking the new "atomic" single-threaded performance vs the old Rc performance of master, gearing up to that next. I am running the benchmark over the rather elaborate 2 million lines of XML I linked to in the previous comment.

dginev commented 5 years ago

Interestingly enough, I believe the slower performance comes from waiting for a Mutex lock to become available. The place where it seems most visible ("hottest path") is at Node::wrap, where we need a mutable lock over the document, to record the currently wrapped node pointer. Safety getting into the way of performance ...

The other aspect is that I am benchmarking a simple sum, which is a near-instant operation, and there is little load to split between the threads. If we make the assumption that instead each operation takes a more significant chunk of time, the multi-threaded code will appear a lot more useful.

triptec commented 5 years ago

I might find some time this weekend

On 22 Mar 2019, at 06:34, Deyan Ginev notifications@github.com wrote:

Interestingly enough, I believe the slower performance comes from waiting for a Mutex lock to become available. The place where it seems most visible ("hottest path") is at Node::wrap, where we need a mutable lock over the document, to record the currently wrapped node pointer. Safety getting into the way of performance ...

The other aspect is that I am benchmarking a simple sum, which is a near-instant operation, and there is little load to split between the threads. If we make the assumption that instead each operation takes a more significant chunk of time, the multi-threaded code will appear a lot more useful.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dginev commented 5 years ago

Thanks! I'll do some more iterations on the PR until then. Out of curiosity I reran my rayon benchmark with 32,16,8,4,2 and 1 threads, and recorded the results in my last commit. It's a very explicit way to confirm that adding more threads adds more overhead, and my best estimate continues to be that the document .lock in Node::wrap is the main culprit for the waiting overhead.

One may wonder if we really need a mutex for threading within the same Document object, but omitting the mutex leads to an immediate runtime panick as the RefCell holding the document gets mutably borrowed multiple times, by each of the node threads.

So, we either need a redesign that makes Node::wrap lock-free (w.r.t document), or this PR will be useful only for jobs which have very significant thread-internal tasks, for which paying the overhead is still worth the thread speedup.

dginev commented 5 years ago

Commenting out the two uses of document in Node::wrap (and thus losing memory safety!) I can confirm we're back to sane multi-threading territory, where more threads = faster performance:

single thread DFS count time:   [766.91 ms 768.93 ms 771.42 ms]                                  
                        change: [-46.972% -46.786% -46.604%] (p = 0.00 < 0.05)
                        Performance has improved.

single thread DFS count+length                                                                          
                        time:   [1.1354 s 1.1364 s 1.1374 s]
                        change: [-34.547% -34.357% -34.194%] (p = 0.00 < 0.05)
                        Performance has improved.

multi thread DFS count  time:   [692.04 ms 693.98 ms 695.69 ms]                                 
                        change: [-55.673% -55.559% -55.446%] (p = 0.00 < 0.05)
                        Performance has improved.

multi thread DFS count+length                                                                          
                        time:   [687.61 ms 688.83 ms 690.10 ms]
                        change: [-61.991% -61.896% -61.801%] (p = 0.00 < 0.05)
                        Performance has improved.

But even in the case of single-threaded classic use of the wrapper, with no rayon participation at all, one can spot a 30-40% speedup on this particular benchmark. Food for thought in the "price of safety" heading.

dginev commented 5 years ago

Some thoughts while I continue to think on the data gathered here. The main reason for me starting this PR is that I have a collection of over 1 million HTML files that I need to very quickly iterate over and perform various tasks on. The task I am running this week is in fact very similar to the benchmark in this PR -- I do a traversal of the nodes, and record statistics on which node/attribute/value triples are in use, which then inform external tool development.

So my initial question was - can I speed a traversal up, knowing that I already have all of the node data eagerly loaded in memory (in C's libxml2), and knowing I have a read-only use case, where mutability wasn't ever a question/need. And it seems like the answer is nuanced, since we pay in runtime performance for certain aspects of safety... Then again, the bookkeeping design was designed in a single-threaded context, and may have a logical refactor that is both safe and fast in a multi-threaded use... Hm...

Edit 1:

dginev commented 5 years ago

So, I pushed a new branch with an entire new experiment that implements my last comment.

A good place to read into the branch is here

The two ideas being that 1) each mutable use of a Node pointer will enforce a document lock and 2) Nodes just wrap over their data, and the only Arc<Node> we need is in the Document bookkeeping, which is still vital for proper memory deallocation on Drop.

What I lost as a feature in that branch: If you create two rust Node objects that point to the same underlying C node, and call mutating methods on both of them, Rust will no longer realize this is a mutability conflict in runtime. This is the case since I had to go back (for now) to the simple wrappers we had before the current Rc<Refcell<...>>, in order to drop all the locking overhead and compare runtimes.

Ok, so benchmarking this lighter approach of bookkeeping/locking-on-mut, on the same examples as above, yields:

single thread DFS count time:   [459.57 ms 459.82 ms 460.33 ms]                                  

single thread DFS count+length                                                                          
                        time:   [710.99 ms 713.95 ms 717.42 ms]

multi thread DFS count  time:   [636.26 ms 637.30 ms 638.78 ms]                                 

multi thread DFS count+length                                                                          
                        time:   [619.51 ms 620.84 ms 622.04 ms]

Where we already see the multi-threaded version overtakes the single thread for the count+length payload, which is near-trivial, and indicates the gains will be significantly more visible in real-world applications. But more importantly, times are near 2x faster than the first naive implementation of the PR, which is quite considerable...

I think I will sleep on the current state, and see if I can recover the mutability_guards test without big runtime penalties... Interesting!

triptec commented 5 years ago

I haven’t had the time to look at those pr’s yet but I’ve been thinking about it a little and wonder a) do we optimize for reading or writing and considering that b) is there any crates that has other primitives than rwlock and mutex that might be a better fit for the problem. I’m thinking of parking_lot, crossbeam and others I haven’t heard of. I’ve never used any multithreaded things other than just using crates that might use it under the covers so yeah

dginev commented 5 years ago

Certainly a deep and open-ended rabbit hole, since this is taking us beyond the simple wrap over libxml2, and we can take the crate in several different directions (and make several different kinds of mistakes...).

The ideal goal would be that you only pay for safety when you are doing something unsafe - so that clean reads, and orderly mutability, work with no runtime cost associated, until you make a mistake -- or the next best formulation -- until you are at a mistake-prone piece of logic.

To make the above point more specific: Any changes to the tree have to be guarded, but if I was designing a library from scratch, I would probably make it possible to add guards at different levels of granularity. Examples of what I would have in mind (in a perfect world):

Then a tree with N subtrees could have N threads do simultaneous localized changes at essentially constant time. And since I have already thought of that, any solution that I arrive for our current very humble wrapper crate feels like sour grapes - we end up paying various unnecessary overhead due to the "retroactive patching" design, or we lose safety in unpleasant ways (that we even already test for).

So back to humble goals, I would summarize as:

I am starting to feel somewhat comfortable with the compromise solution on the new branch I made, but will still need at least a week of thinking and tinkering to feel "solid" about it ...

As to using different synchronization primitives - maybe that helps towards getting closer to the "good design" goals? I will read a little into the alternative crates out there, I also haven't spent serious time studying the ecosystem.

dginev commented 5 years ago

Closing this experiment in favor of #53 , will postpone thinking of parallel processing with mutability for a while longer (until someone comes and has a great solution!)