Closed dginev closed 5 years ago
If I understand the Rust book correctly: https://doc.rust-lang.org/nightly/book/ch19-01-unsafe-rust.html#implementing-an-unsafe-trait
If we have a raw pointer, irrespective of how many layers of guarantees we wrap it with, the compiler can never verify if it is Send+Sync, the way it could for native-derived types. Thus the onus is on the module author to manually make the claim via an unsafe impl Send/Sync for X {}
, as I have done here.
In which case the main consideration is how to stay lean in performance?
Got some encouragement (on the Rust Discord server) that atomics aren't felt that strongly in single-threaded contexts, so maybe this is a sensible approach after all... Will run a benchmark on some large XML snippet to see if any visible slowdown occurs ...
I had a sneaking suspicion this would be the case - a naive benchmark shows a parallel iterator DFS to be much slower than the single-threaded version. I pushed the example at b2fff0d2428c55b81ea53bddd6939894e45ef3f9
and the first benchmark results with Criterion were:
Running target/release/deps/parsing_benchmarks-e718803c0af36e8a
single thread time: [1.4070 s 1.4151 s 1.4187 s]
multi thread time: [4.6913 s 4.6964 s 4.7012 s]
This is particularly shocking since I have 32 logical threads available on my 1950x threadripper CPU, so there may be a significant mistake in the way I've implemented the benchmark... Will dig in and report back. Also, this is not yet benchmarking the new "atomic" single-threaded performance vs the old Rc
performance of master, gearing up to that next. I am running the benchmark over the rather elaborate 2 million lines of XML I linked to in the previous comment.
Interestingly enough, I believe the slower performance comes from waiting for a Mutex lock to become available. The place where it seems most visible ("hottest path") is at Node::wrap
, where we need a mutable lock over the document, to record the currently wrapped node pointer. Safety getting into the way of performance ...
The other aspect is that I am benchmarking a simple sum, which is a near-instant operation, and there is little load to split between the threads. If we make the assumption that instead each operation takes a more significant chunk of time, the multi-threaded code will appear a lot more useful.
I might find some time this weekend
On 22 Mar 2019, at 06:34, Deyan Ginev notifications@github.com wrote:
Interestingly enough, I believe the slower performance comes from waiting for a Mutex lock to become available. The place where it seems most visible ("hottest path") is at Node::wrap, where we need a mutable lock over the document, to record the currently wrapped node pointer. Safety getting into the way of performance ...
The other aspect is that I am benchmarking a simple sum, which is a near-instant operation, and there is little load to split between the threads. If we make the assumption that instead each operation takes a more significant chunk of time, the multi-threaded code will appear a lot more useful.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks! I'll do some more iterations on the PR until then. Out of curiosity I reran my rayon benchmark with 32,16,8,4,2 and 1 threads, and recorded the results in my last commit. It's a very explicit way to confirm that adding more threads adds more overhead, and my best estimate continues to be that the document .lock
in Node::wrap
is the main culprit for the waiting overhead.
One may wonder if we really need a mutex for threading within the same Document
object, but omitting the mutex leads to an immediate runtime panick as the RefCell
holding the document gets mutably borrowed multiple times, by each of the node threads.
So, we either need a redesign that makes Node::wrap
lock-free (w.r.t document), or this PR will be useful only for jobs which have very significant thread-internal tasks, for which paying the overhead is still worth the thread speedup.
Commenting out the two uses of document
in Node::wrap
(and thus losing memory safety!) I can confirm we're back to sane multi-threading territory, where more threads = faster performance:
single thread DFS count time: [766.91 ms 768.93 ms 771.42 ms]
change: [-46.972% -46.786% -46.604%] (p = 0.00 < 0.05)
Performance has improved.
single thread DFS count+length
time: [1.1354 s 1.1364 s 1.1374 s]
change: [-34.547% -34.357% -34.194%] (p = 0.00 < 0.05)
Performance has improved.
multi thread DFS count time: [692.04 ms 693.98 ms 695.69 ms]
change: [-55.673% -55.559% -55.446%] (p = 0.00 < 0.05)
Performance has improved.
multi thread DFS count+length
time: [687.61 ms 688.83 ms 690.10 ms]
change: [-61.991% -61.896% -61.801%] (p = 0.00 < 0.05)
Performance has improved.
But even in the case of single-threaded classic use of the wrapper, with no rayon participation at all, one can spot a 30-40% speedup on this particular benchmark. Food for thought in the "price of safety" heading.
Some thoughts while I continue to think on the data gathered here. The main reason for me starting this PR is that I have a collection of over 1 million HTML files that I need to very quickly iterate over and perform various tasks on. The task I am running this week is in fact very similar to the benchmark in this PR -- I do a traversal of the nodes, and record statistics on which node/attribute/value triples are in use, which then inform external tool development.
So my initial question was - can I speed a traversal up, knowing that I already have all of the node data eagerly loaded in memory (in C's libxml2), and knowing I have a read-only use case, where mutability wasn't ever a question/need. And it seems like the answer is nuanced, since we pay in runtime performance for certain aspects of safety... Then again, the bookkeeping design was designed in a single-threaded context, and may have a logical refactor that is both safe and fast in a multi-threaded use... Hm...
Edit 1:
document.lock()
explicitly, to guarantee thread consistency.Node
or Document
would thus require a lock on the entire document, as we don't have a good enough handle on the data model to be able to only lock the current subtree.
_Node
s, shouldn't require the entire document to be locked, as they don't mutate it, they simply expose already existing underlying data.Node::node_ptr_mut
to change. In a way where it doesn't return the raw pointer, but instead accepts a closure, with the raw pointer as a single argument, and executes the mutable payload on a locked document.Node::wrap
needs to be addressed separately, especially since it is the biggest source of slowdown. I think the only reasonable escape that addresses performance is to not record the node in the document when it is wrapped, but instead to record it when it is mutated or dropped. In effect, I am thinking to wrap an Arc<>
around the raw pointer, or something to that degree... And bookkeeping of the Node memory exclusively on Drop
, locking the Document when nodes drop and recording them, so that a single pass of libxml2-level free()
-ing can occur when the document drops.So, I pushed a new branch with an entire new experiment that implements my last comment.
A good place to read into the branch is here
The two ideas being that 1) each mutable use of a Node pointer will enforce a document lock and 2) Nodes just wrap over their data, and the only Arc<Node>
we need is in the Document bookkeeping, which is still vital for proper memory deallocation on Drop.
What I lost as a feature in that branch: If you create two rust Node
objects that point to the same underlying C node, and call mutating methods on both of them, Rust will no longer realize this is a mutability conflict in runtime. This is the case since I had to go back (for now) to the simple wrappers we had before the current Rc<Refcell<...>>
, in order to drop all the locking overhead and compare runtimes.
Ok, so benchmarking this lighter approach of bookkeeping/locking-on-mut, on the same examples as above, yields:
single thread DFS count time: [459.57 ms 459.82 ms 460.33 ms]
single thread DFS count+length
time: [710.99 ms 713.95 ms 717.42 ms]
multi thread DFS count time: [636.26 ms 637.30 ms 638.78 ms]
multi thread DFS count+length
time: [619.51 ms 620.84 ms 622.04 ms]
Where we already see the multi-threaded version overtakes the single thread for the count+length
payload, which is near-trivial, and indicates the gains will be significantly more visible in real-world applications. But more importantly, times are near 2x faster than the first naive implementation of the PR, which is quite considerable...
I think I will sleep on the current state, and see if I can recover the mutability_guards
test without big runtime penalties... Interesting!
I haven’t had the time to look at those pr’s yet but I’ve been thinking about it a little and wonder a) do we optimize for reading or writing and considering that b) is there any crates that has other primitives than rwlock and mutex that might be a better fit for the problem. I’m thinking of parking_lot, crossbeam and others I haven’t heard of. I’ve never used any multithreaded things other than just using crates that might use it under the covers so yeah
Certainly a deep and open-ended rabbit hole, since this is taking us beyond the simple wrap over libxml2, and we can take the crate in several different directions (and make several different kinds of mistakes...).
The ideal goal would be that you only pay for safety when you are doing something unsafe - so that clean reads, and orderly mutability, work with no runtime cost associated, until you make a mistake -- or the next best formulation -- until you are at a mistake-prone piece of logic.
To make the above point more specific: Any changes to the tree have to be guarded, but if I was designing a library from scratch, I would probably make it possible to add guards at different levels of granularity. Examples of what I would have in mind (in a perfect world):
Then a tree with N subtrees could have N threads do simultaneous localized changes at essentially constant time. And since I have already thought of that, any solution that I arrive for our current very humble wrapper crate feels like sour grapes - we end up paying various unnecessary overhead due to the "retroactive patching" design, or we lose safety in unpleasant ways (that we even already test for).
So back to humble goals, I would summarize as:
I am starting to feel somewhat comfortable with the compromise solution on the new branch I made, but will still need at least a week of thinking and tinkering to feel "solid" about it ...
As to using different synchronization primitives - maybe that helps towards getting closer to the "good design" goals? I will read a little into the alternative crates out there, I also haven't spent serious time studying the ecosystem.
Closing this experiment in favor of #53 , will postpone thinking of parallel processing with mutability for a while longer (until someone comes and has a great solution!)
First stab at thread safety ( #46 ).
I would be quite lucky if you have some time to review and think about the general problem @triptec . Or if you know of other crates that may have solved a similar issue nicely, that we can copy a method from.
In this PR I added the "best practice" from the official rust book, which is to hide the libxml node pointer behind even more indirection+guarantees:
And you could still pretend this is a regular rust data structure and call
node.set_attribute(...)
or similar on a mutable node, without ever realizing there is an atomic reference counter and a mutex behind theNode
struct. In theory this enables parallel traversals / mutations such as the mock test added here, to the tune of:and get the parallel processing for free via the rayon iterator, while still remaining safe for mutation thanks to the
Arc<Mutex<...>>
.The two main bits that are worrying me are:
NodeAtomic
andDocumentAtomic
, which implement Send+Sync, instead of overriding the basic structs, so that single-threaded users are not impacted... but the boilerplate will become huge if I try to maintain two implementations...unsafe impl Sync for Node {} unsafe impl Send for Node {}