Closed dginev closed 5 years ago
For posterity, writing down my conclusion of #47 and #53
I have become convinced that parallel mutability over libxml2 is very likely more trouble than it is worth to retrofit. As the entire document tree is eagerly loaded in memory, to be truly safe in Rust one needs to recreate the entire tree as a set of lockable exclusion rules. Or to keep it simple - lock the entire document on each mutable operation.
As I have commented in #47, naively locking the document leads to long starvation blocking on any non-trivial number of threads, adding up to overhead that makes performance worse than the master branch single-threaded approach.
So, at least for the moment, I am happy to think of single threaded mutable Node
and Document
, which can't be used with parallel code, together with the parallelization-friendly read-only primitive RoNode
, which allows for multi-threaded scanning.
Another note is that actual visible performance boost can be spotted only when processing huge batches of documents, or very large individual documents, so it wouldn't matter for more casual applications.
On 32 logical threads (threadripper 1950x), using RoNode
has shown 25-fold speedup on scanning a single 112 MB XML file, as well as a 20-fold speedup on scanning 1.2 million HTML files (see https://github.com/KWARC/llamapun/issues/28 ), down to only 2 hours from almost 2 days.
That said, since 0.2.10
the wrapper is now thread-friendly for all read-only use cases, so definitely basic thread safety is achieved.
Follow-up to #20 which didn't get too far. It would be great if we could have
Sync
nodes, so that we could e.g. do a rayon parallel iteration over a node's children, speeding up dramatically various read-only traversals.What I don't currently understand is whether we must pay the price of an
Arc<Mutex<...>>
wrapper, or we could satisfy the guarantees in a lighter fashion. It would be great if I could find a different Rust wrapper that is already taking care of concurrency and learn from their implementation ofSend + Sync
guarantees for the wrapped pointers...My current goal is to ensure thread-safety withing a single document, which should be substantially simpler than ensuring multi-document thread-safety. And just getting the read-only guarantees would still be a decent win, since that's most of what one needs in a huge class of applications (e.g. reading in config files, parsing statistics...)