bokuweb / docx-rs

:memo: A .docx file writer with Rust/WebAssembly.
https://bokuweb.github.io/docx-rs/
MIT License
333 stars 57 forks source link

quick_xml has better performance(around 50 times faster) than xml-rs #598

Open sxhxliang opened 1 year ago

sxhxliang commented 1 year ago

Is your feature request related to a problem? Please describe.

On a particular file, quick-xml is around 50 times faster than xml-rs crate. https://github.com/tafia/quick-xml#performance-1

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

git-noise commented 1 month ago

I had a quick look, and it does not seem to be a simple drop-in replacement as both crates use a slightly different approach. Also XML-related code quickly becomes interdependent, as by nature there is a lot of nested calls.

It seems to represent a significant rework on the crate, which potential side-effect difficult to isolate. To give you an example, migrating comment_extended and comments_extended to quick-xml gave me some failing tests due to some issues with comment and comments.

I had say it would probably require writing a lot more tests, and maybe introduce a gate feature.

@bokuweb let me know if you think it is worth it.

git-noise commented 1 month ago

Hello again, I had a closer look and it seems we could use the serde capabilities of quick-xml to simplify part of the code. It will however probably be a significant/big change that will impact the crate, as it means essentially overhauling the whole serialization/deserialization process.

@bokuweb please let me know if you are open to such a change.

bokuweb commented 1 month ago

@git-noise Hello. I am highly motivated to switch to quick-xml and to reduce the code, but I am concerned about breaking changes. If the changes can be minimized, I would definitely consider it.

git-noise commented 1 month ago

What I can do is showcase the migration gains/trade-offs on a few elements on the reader-side. Some of them seem relatively "isolated" enough that they could be good candidates. I'll put that into a branch on my fork and you can then see if it goes in the right direction or not. We can then maybe decide on /design a more comprehensive action plan?

Would that work?

git-noise commented 1 month ago

@bokuweb

Hello,

I migrated some of the code - here https://github.com/git-noise/docx-rs/tree/quick-xml - you can see some initial work done on:

They were picked up a bit randomly because they're fairly independent and could easily be isolated. For now the reader logic - so essentially deserialization - relies on one additional Trait FromXMLQuickXml which is the pendant of FromXML for quick-xml

==== TLDR:

==== In more details - please note that this is based on the current code changes, and it may not hold once we migrate more modules:

Pros:

  1. The main point, is that it significantly reduces the code base as there is almost no need for manual XML parsing:
    • parse_xml() now relies on quick-xml serde features
    • consequently, ElementReader trait would probably not be needed anymore.
    • some cases require some manual parsing adjustments, but they can be handled with simpler macro.
  2. Most of the public-API seems to remains the same, which should limit breaking changes. I added some more tests, but the existing ones still pass successfully, which hints that there should not be any regressions.
  3. Maybe performance, but until we actually migrate the code-base, it may be difficult to quantify.

Cons:

  1. It is a significant change/workload and more tests would need to be written to catch regressions.
  2. It may not be the biggest issue, but MSRV for quick-xml seems to be 1.61 - I see clippy currently runs with 1.56 on docx-rs.
  3. A bigger concern will be a discrepancy between the JSON-serialization and the XML-serialization patterns. While for now I concentrated on the reader-side, a benefit down the road would be to tackle the writer-side (so the XML-serialization). To give you an example, right now Extended Comments, serialize to extendedComments in JSON and to exComments in XML (as per OOXML convention). There are also differences on how internal/children fields are represented. -> I think this would represent a breaking change for people using the JSON representation for docx files
git-noise commented 2 weeks ago

Hello, @bokuweb, just wanted to check if you had time to review this one