ballsteve / xrust

XPath, XQuery, and XSLT for Rust
Apache License 2.0
85 stars 7 forks source link

Parser dev2 #44

Closed Devasta closed 1 year ago

Devasta commented 1 year ago

Hi Steve,

Big changes within:

Changes:

I'll be using that for profiling and trying to figure out where the bottlenecks are, I've been using flamegraph to do it. 45 megabytes isn't a big file for more peoples uses, but I think we'd get in trouble if I loaded a bigger one to github!

I intend to make smaller, more frequent changes going forward, but the replacing of the main parser was an all-or-nothing change so had to be a large chunk, and the longer I left this one go the further everything was diverging from your work. I'll get namespaces sorted next and sort the remaining conformance errors.

Let me know if you feel additional work is required to bring this PR up to scratch.

ballsteve commented 1 year ago

Hi Daniel,

That sounds like great progress!

I’ll merge my dev branch with your changes, and take a look.

Cheers, Steve

On 11 Dec 2022, at 10:33 am, Daniel Murphy @.***> wrote:

Hi Steve,

Big changes within:

Changes:

Big rewrite of the XML parser. It still uses the same strategy of parser combinators, however instead of using nom I've implemented a new set of combinators that work mostly the same; instead of a &str as input they accept a ParserInput Struct that can carry stuff like entities along with various iterators for the document it parses. The other parsers such as for xpath are still using nom, haven't looked at changing but it'd be straightforward I think. Added all remaining XML conformance tests. Lots of them still failing, will work on resolving (which will mean eventually needing to validate DTDs, which I have no idea how to implement). A new test "bigfile" included, parsing a document called "45M.xml". The contents of this file is a simple document which has the following:

bigtext

I'll be using that for profiling and trying to figure out where the bottlenecks are, I've been using flamegraph to do it. 45 megabytes isn't a big file for more peoples uses, but I think we'd get in trouble if I loaded a bigger one to github!

I intend to make smaller, more frequent changes going forward, but the replacing of the main parser was an all-or-nothing change so had to be a large chunk, and the longer I left this one go the further everything was diverging from your work. I'll get namespaces sorted next and sort the remaining conformance errors.

Let me know if you feel additional work is required to bring this PR up to scratch.

You can view, comment on, or merge this pull request online at:

https://github.com/ballsteve/xrust/pull/44

Commit Summary

0cd0f1d https://github.com/ballsteve/xrust/pull/44/commits/0cd0f1d6a32a11b65e9ce1a175580904fa769662 merging e751ef3 https://github.com/ballsteve/xrust/pull/44/commits/e751ef3dd0bcf4e1d8a68890f648cd06d17f5506 Cargo clippy fixes. File Changes (68 files https://github.com/ballsteve/xrust/pull/44/files) M src/intmuttree.rs https://github.com/ballsteve/xrust/pull/44/files#diff-d2703c6a0f2a43bbeaf5c12a1b037c5051782aa4ea82b13d2d0e26b5cc0ee46f (71) M src/lib.rs https://github.com/ballsteve/xrust/pull/44/files#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759 (4) A src/parser/combinators/alt.rs https://github.com/ballsteve/xrust/pull/44/files#diff-7c2c918fcc08bc2a4f4927eb709180d56be79950884a7e1761741721eb63628a (175) A src/parser/combinators/delimited.rs https://github.com/ballsteve/xrust/pull/44/files#diff-ba4c2b12a0b8ff5dd5a60b22851c6b0a4b707719ed304467d4abe51b2a37b0ac (40) A src/parser/combinators/expander.rs https://github.com/ballsteve/xrust/pull/44/files#diff-bb5f08d33c399071b5c0af91ee6bb76357ec2704ba462d37e3766911a2b7cdc4 (59) A src/parser/combinators/many.rs https://github.com/ballsteve/xrust/pull/44/files#diff-376217d6b8c95e5957a18a43ad23e9ccaef0cf4472694c00b544dd52d545230f (42) A src/parser/combinators/map.rs https://github.com/ballsteve/xrust/pull/44/files#diff-49a6d828b0426f9690c180ec8152228ba1d9496f080bc977f098df0ca8de8000 (13) A src/parser/combinators/mod.rs https://github.com/ballsteve/xrust/pull/44/files#diff-f5ea9c37370db5b7a93d9b3e10fb53c64a4d31f06825dedea8277967076d8f64 (14) A src/parser/combinators/none_of.rs https://github.com/ballsteve/xrust/pull/44/files#diff-ff18170a3c5d95058ddb1f8aa254996bdeaa993c655ea80d1e512d44ccc08501 (14) A src/parser/combinators/opt.rs https://github.com/ballsteve/xrust/pull/44/files#diff-ac98c42eaab971fb663092d0d6e891e20785d672e610f61d051dbab5c02ef5a2 (38) A src/parser/combinators/tag.rs https://github.com/ballsteve/xrust/pull/44/files#diff-46707d91c5951842e8181685a0db9591fd204d011762e082eea39dc1724ad183 (44) A src/parser/combinators/take.rs https://github.com/ballsteve/xrust/pull/44/files#diff-59b7427e724e23e7ad3bcd7bc0ced764891be46894b65ae8698b55ce23883783 (182) A src/parser/combinators/tuple.rs https://github.com/ballsteve/xrust/pull/44/files#diff-0f93fd53ff9e57443e4f4fcce5b976a5d5bef0dd6dd67e22ecd6386311c7e277 (417) A src/parser/combinators/validate.rs https://github.com/ballsteve/xrust/pull/44/files#diff-0751425f3af4ef473ee5dc75288bc12871ecf79c584e0d9cd8b5e9aff335e4fb (18) A src/parser/combinators/value.rs https://github.com/ballsteve/xrust/pull/44/files#diff-c0b91f7e0d153633505cd5c1e9492ec0071dbe52bcbcd50799d12eae51e130e9 (11) A src/parser/combinators/whitespace.rs https://github.com/ballsteve/xrust/pull/44/files#diff-3782636b4aff78fcbd7133e9e073c69ea6d664713cf636e9f7a05091cf28d0f1 (19) A src/parser/common.rs https://github.com/ballsteve/xrust/pull/44/files#diff-0e7c16789ef5951101f0c30cc2043b0cab47416cf5ce498b25a8961989081d94 (139) A src/parser/mod.rs https://github.com/ballsteve/xrust/pull/44/files#diff-fc6a8d66b8cb6bd48a119a70e489cbcef82e7f551e7cf08e8e972ef4e774ef49 (70) A src/parser/xml.rs https://github.com/ballsteve/xrust/pull/44/files#diff-e4f59376844bf3b3296284412d8928e2e897b8558dc281c9d0483605a913155d (902) M src/parsexml.rs https://github.com/ballsteve/xrust/pull/44/files#diff-1c1b1bf87f98907cf7542e04c8b312a87c71a315c96cf8e927f01edfc20a38d6 (49) A tests/conformance/xml/eduni_errata2e_error.rs https://github.com/ballsteve/xrust/pull/44/files#diff-aaacd693fde198bb2d9b73591030f42b0f94b4c058c894b04d4b5ab4f68b3a98 (56) A tests/conformance/xml/eduni_errata2e_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-13d05f52a912d0907575118009a75819682c544d706fe8c3192c79ecffd0e61f (191) A tests/conformance/xml/eduni_errata2e_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-6e24502b3728ca97e855f0a79371071534c2f9302a56e59ab68757ae6a850090 (55) A tests/conformance/xml/eduni_errata2e_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-72b8e04dc1a574af93cd821cb5ef822ebc35e710b01fa248f2988f2487f957fc (283) A tests/conformance/xml/eduni_errata3e_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-230c8bec57a6e3b76ac64e0365dda666150c11e550313ee20f5ea7c0757d77fc (154) A tests/conformance/xml/eduni_errata3e_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-0ad564fc8c3a4ed005184a66d5030e34ec317f30e513d5c477ef06aa6d85d054 (23) A tests/conformance/xml/eduni_errata3e_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-f38dee90db1e718c0379402cdb92e89d711c507cffd9f698ef0e1a5f2a1a3ec5 (55) A tests/conformance/xml/eduni_errata4e_error.rs https://github.com/ballsteve/xrust/pull/44/files#diff-6739687b7e727e81672f7dbdb63da257a72fcc425ccd66d437b4442f27bc8968 (75) A tests/conformance/xml/eduni_errata4e_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-853e0c9d03245b72505861efc69b17a6d26e6ded9dc04419580575fc758d0720 (306) A tests/conformance/xml/eduni_errata4e_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-b136c7138dbdbe9ca35a5a4f39187e325503ee6faa260b422ff49e38ec94d353 (983) A tests/conformance/xml/eduni_errata4e_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-a7a9f599fb9cd401ec46d297e19e5f303e1b62fb96dc506538f7a6342e39b658 (4967) A tests/conformance/xml/eduni_misc_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-f5af67f53904ecb63e4eb53a85b0db4864b690822d8af421641ff61b2c1e695c (39) A tests/conformance/xml/eduni_misc_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-90fe64ddd495a3f4dbef02def9d9d837c97586f59728a393bd1078c2df2db97d (119) A tests/conformance/xml/eduni_namespaces_10_error.rs https://github.com/ballsteve/xrust/pull/44/files#diff-28da3e9e3a08b12fad14f1e16516484dd161e2bfa3fe87880a577fa370abaf40 (56) A tests/conformance/xml/eduni_namespaces_10_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-5e118f66a7c22a823d1604bfd9b3a3185ece17067f1f10dbaffccb3a42dd982b (280) A tests/conformance/xml/eduni_namespaces_10_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-70a7d8df513811a21100da56771b66ac65aafa9852ceacb1e701ab79ff3d7081 (343) A tests/conformance/xml/eduni_namespaces_10_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-cbca886afacdeeb509fcdcd809e828c77b137ceb28aaa1b33408ab06b5698090 (119) A tests/conformance/xml/eduni_namespaces_11_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-e05bd2aca6f42f760d74b752cad65da047a4ef3bb488f25b8ce2167a9de70efe (55) A tests/conformance/xml/eduni_namespaces_11_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-dcaebbdcebb6bdbc4f8097268082a75434cc0bdefb1133945e014d206e45e7a7 (88) A tests/conformance/xml/eduni_namespaces_errata1e_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-5a95a29121201342175481dcb3f6235df1bc053c3746272f2d7e08120ae983d6 (58) A tests/conformance/xml/eduni_xml11_error.rs https://github.com/ballsteve/xrust/pull/44/files#diff-08ca48142f5e4cc9f5e98bda5c614b697f0f8a00a248dbfbf31d83d6fce54ee9 (87) A tests/conformance/xml/eduni_xml11_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-96e530177ef3a43ca7327d4389dcde57ba0f2878b395f9cc4dc0e18dd67b98b6 (183) A tests/conformance/xml/eduni_xml11_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-706f0d7f25d446b11bd7cf20b824d7702309d73dbed3a827692a5f5dbcc2e61f (263) A tests/conformance/xml/eduni_xml11_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-dc73add474116f8635dbb69d29ddfd3a7a4b5f6234012e0c3329717a2205280b (407) A tests/conformance/xml/ibm11_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-b87d6e5a4e6be1b3545d0e2cb771d4426efed49b1583b0df8c04c0679b2952f7 (43) A tests/conformance/xml/ibm11_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-9e45ffd4d60825a5b1c35a3b671c8e9d9385f1235db1b1cb929c7e55cbb50fec (2610) A tests/conformance/xml/ibm11_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-edd2ef3d4eff092e0638aafa14f9c70a38ca5b1c2a3fc0836f44b7815075e381 (911) A tests/conformance/xml/ibm_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-a7ce82b2267f0097c99a7e881834cff9d8ecfcc682c2c6762f10d2501ce4181f (777) A tests/conformance/xml/ibm_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-91e85983afa5478d45ce670604b5a7b27c299120ad4140d4a0d014994fc893d0 (11704) A tests/conformance/xml/ibm_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-47d136ffa7d95af70dfd4e951cf8269c231a8be5236f4abf683ce6c337b129d0 (3093) M tests/conformance/xml/mod.rs https://github.com/ballsteve/xrust/pull/44/files#diff-9368dd2eb1f4ca0163a582894340a1bbc580ea593d15840103c424c9f5018692 (36) M tests/conformance/xml/oasis_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-4547518bf5a8f30690f6d719f24c5e08130e7458cfcfa2a4ca286edb13ed6d89 (497) M tests/conformance/xml/oasis_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-b678fc86e03ac6079e016d91cccc439ef1ae7a922bb347e263eab3b73bdc5f0c (94) A tests/conformance/xml/sun_error.rs https://github.com/ballsteve/xrust/pull/44/files#diff-685b0a2106c6514df73dd01f1a7b46b21b8cb67259c5c38ea7de86c21aecd200 (25) A tests/conformance/xml/sun_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-2ff0c3e922071b3b51cc1df519d58a824dd1510d1556bcd41bdd568bd08ab86e (1193) A tests/conformance/xml/sun_notwf.rs https://github.com/ballsteve/xrust/pull/44/files#diff-8dbebf88f4c44122e478359a1a78840c71742b4680978cf3a242f57efc9901b3 (904) A tests/conformance/xml/sun_valid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-9a54d50deb16491839b5bf7c4f617b19a176feb6170b2f44d1e1b30704e779da (601) M tests/conformance/xml/xmltest_invalid.rs https://github.com/ballsteve/xrust/pull/44/files#diff-d3794b57be6e4189ba7ca2d0c650a04fe19ed21bab5c67b77a6ab0ad94a7fd8e (10) M tests/conformance/xml/xmltest_notwf_ext_sa.rs https://github.com/ballsteve/xrust/pull/44/files#diff-87990800832b3a4664b51715410444d9b9bafa902145bbf0993f02c9f1a3f49a (8) M tests/conformance/xml/xmltest_notwf_not_sa.rs https://github.com/ballsteve/xrust/pull/44/files#diff-7f28d1defacdb5926e5f2c491fee640bc81e1384c1104cebbb35fbf73a8f245c (20) M tests/conformance/xml/xmltest_notwf_sa.rs https://github.com/ballsteve/xrust/pull/44/files#diff-62e829316c98339b6e3acdbbf8da6c212dbeee3d116a3275db5e77b29c9d565d (374) M tests/conformance/xml/xmltest_valid_ext_sa.rs https://github.com/ballsteve/xrust/pull/44/files#diff-c7e33c502b46f53bfb531fdeb7b160fc1dedea8bd5a72d306c08fa12d3af926e (54) M tests/conformance/xml/xmltest_valid_not_sa.rs https://github.com/ballsteve/xrust/pull/44/files#diff-a7788fc3cd58eb30049276e4842ae0b237661ec619701fa88ab6e9df03994ff1 (122) M tests/conformance/xml/xmltest_valid_sa.rs https://github.com/ballsteve/xrust/pull/44/files#diff-225a9bef8a202c37fbc93f8906f788376eb637bc96bcb391caa7aa37f22a58d9 (0) M tests/conformance/xml/xmltest_valid_sa_canonicalonly.rs https://github.com/ballsteve/xrust/pull/44/files#diff-de156dc3ab12a4e38d8037fd69f96d75ceabfa29d24144df9a96c03e03eedc0e (0) M tests/intmuttree.rs https://github.com/ballsteve/xrust/pull/44/files#diff-e6e6eb9fefc65f6d6f7adbd9b6fdacb05b10152480c9ded2f5a53cf6334261aa (0) M tests/mod.rs https://github.com/ballsteve/xrust/pull/44/files#diff-186198b8fadd47ef2416dc742b1d55b24dbadfd2b1599e0a22dc4c297afd8977 (0) A tests/xml/45M.xml https://github.com/ballsteve/xrust/pull/44/files#diff-9d08e0f8b53f07f591d6cb3935681fc83ec4579430f98a8445a3b90458bf3561 (0) Patch Links:

https://github.com/ballsteve/xrust/pull/44.patch https://github.com/ballsteve/xrust/pull/44.diff — Reply to this email directly, view it on GitHub https://github.com/ballsteve/xrust/pull/44, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCP5QLGMGZ5ZGHVHGVQMXLWMUHNRANCNFSM6AAAAAAS2SEGU4. You are receiving this because you are subscribed to this thread.

ballsteve commented 1 year ago

Hi Daniel,

BTW, Romeo has asked me to give a talk on xrust to the Stuttgart Rust User Group. That will be today at 6pm CET (ie Thursday 15th, or Friday 16th at 4am AEDT :-( ). I'll be mentioning your work on the XML parser.

Devasta commented 1 year ago

Really glad to hear that! Let me know how it goes. 😄 On 15 Dec 2022, 04:36 +0000, Steve Ball @.***>, wrote:

Hi Daniel, BTW, Romeo has asked me to give a talk on xrust to the Stuttgart Rust User Group. That will be today at 6pm CET (ie Thursday 15th, or Friday 16th at 4am AEDT :-( ). I'll be mentioning your work on the XML parser. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ballsteve commented 1 year ago

The talk for Stuttgart Rust User Group went well! There were only two problems: (1) I had to get up at 4am :-(, and (2) there was only enough time (0.5 hour talk) to give a high-level overview of Xrust, and some people wanted to get into the detail of trees and parsers. We may have to give more talks!

ballsteve commented 1 year ago

Looking at the PR, I have a few comments:

  1. There is a ParseInput and a Parserinput. These names are too close... what is the difference between them?
  2. In mod.rs, the implementation of Iterator defines an associated type; Item. The next function returns Option<char> - that should be Option<Self::Item>
Devasta commented 1 year ago
  1. Right, this one is my fault. Essentially, it started that the combinators would take a tuple with (document, offset, error) and then I used a type alias to clean up the code, but it continued to grow legs so I moved it all into the Parserinput struct with the alias remaining as (parserinput, error location). I could change the struct name to something like parserstate, or I could get rid of the type alias entirely. The usize that contains the location of the error I am hoping to extend out in future to provide details of what actually is going wrong.

  2. Good catch! Will fix that now!

ballsteve commented 1 year ago

After merging this branch (which you can now delete, btw), the XSLT tests are failing because the parser is not handling XML Namespaces.

I'll take a look and see why this is not working correctly, but if you can advise on the state of namespace support that would be great!

Merry Xmas!

Devasta commented 1 year ago

Will be working on namespaces next. Sorry about that. I think I should be able to carry those along on the parser.

Merry christmas to you too!

ballsteve commented 1 year ago

Hi Daniel,

No worries.

I’ll let you get on with that, and eagerly await the result ;-)

In the meantime, I’m going to take a look at writing a transformation combinator (or at least thinking about it ;-)

Cheers, and Merry Xmas and New Year, Steve

On 22 Dec 2022, at 9:54 am, Daniel Murphy @.***> wrote:

Will be working on namespaces next. Sorry about that. I think I should be able to carry those along on the parser.

Merry christmas to you too!

— Reply to this email directly, view it on GitHub https://github.com/ballsteve/xrust/pull/44#issuecomment-1362196952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUCP5QK3Q4AFRMQOFIU4DZLWOODDHANCNFSM6AAAAAAS2SEGU4. You are receiving this because you modified the open/close state.