delb-xml / delb-py

A library that provides an ergonomic model for XML encoded text documents (e.g. with TEI-XML).
https://delb.readthedocs.io
GNU Affero General Public License v3.0
16 stars 0 forks source link

Adds TagNode.fetch_or_create_by_xpath #20

Closed funkyfuture closed 3 years ago

funkyfuture commented 3 years ago

It's surprisingly simple to implement an XPath parser. :grimacing:

funkyfuture commented 3 years ago

@JKatzwinkel that covers your cases, doesn't? is the method's docstring sufficiently comprehendible?

JKatzwinkel commented 3 years ago

Looks really great, method docstring makes sense, I will give it a try later, thanks!

funkyfuture commented 3 years ago

thanks for the additional tests.

my more "correct" (less hackish, really) implementation of the predicates now only fails with expressions like ./cit[@type="translation"][@lang="en"] that are first transformed to ./entry/sense/cit[(@type="translation") and (@lang="en")], but bracket parsing isn't implemented. currently i'd say that this doesn't need to be part of a future, delb-native implementation of a subset of the XPath-grammar. hence that initial normalization could yield a simpler expression. but then, there's the risk that someone really throws in something like cit[@foo or @bar][@peng="pow"]. and that then incorrectly transformed expression trickles through to lxml's xpath evaluator.

i already tend to a Verbot of overly complicated XPath expressions for now. and maybe one should aim for a first spec of a supported subset of XPath in the 0.4 release? @03b8, @JKatzwinkel, @zed-g, do you have any thoughts on that?

funkyfuture commented 3 years ago

while elaborating the issues with @JKatzwinkel we concluded:

oh, and btw, here's an attempt to implement XPath with a generic parser: https://github.com/emory-libraries/eulxml/tree/master/eulxml/xpath

funkyfuture commented 3 years ago

it certainly could use more tests, but parenthesized expressions should now be parsed properly.

but yes, functions now pose an unsolved problem. me currently thinks that only functions where an attribute makes sense as first argument and that return a boolean are simple enough to be supported. that are starts-with and contains. and not makes sense.

JKatzwinkel commented 3 years ago

While adding arbitrary assertions to test_quotes_in_css_selector, I realized that any specification of an xpath subset we end up supporting would require us to also specify what parts of css we do and don't support... For instance, at this point, even after "implementation" of boolean xpath functions "support", we do not support the css selector [attribute$=value], because the css library in use doesn't convert $= into xpath's ends-with() function, but rather into an expression involving substring(), the return type of which isn't boolean.

funkyfuture commented 3 years ago

there's no 'native' ends-with function in the XPath 1.0 specs, but it should be specified b/c of consistency w/ starts-with.

that won't help with the css selectors though. maybe we'd just vendorize an adapted version of the transforming lib. and then there's the long term question how it would be done in Rust. my guess is that seperate parsers and evaluators for CSS and XPath are the simplest thing to do.

funkyfuture commented 3 years ago

thanks a lot for the valuable input. i wanted to put an end to this PR before it grew endlessly.