FrankensteinVariorum / fv-data

TEI data for the Frankenstein Variorum project
The Unlicense
3 stars 0 forks source link

Namespace mixup in string-range #12

Closed zmbq closed 1 year ago

zmbq commented 5 years ago

The string-range pointers have XPaths that reference the tei namespace. This namespace is not defined - not on the Spine file and not in the chunk file (both have the default namespace as the TEI namespace, but a parser can't figure this out).

I am going to automatically remove all tei: references from string-range expressions for now, I think you should do the same in these files.

This, actually, is an interesting problem. If an xpath is written in one files but references another, which namespaces are used in the xpath? The ones in the file where the xpath is written, or the ones in the file the xpath references? I would expect it to be the namespaces in the referenced file.

ebeshero commented 5 years ago

On investigating this interesting question, we took a look at the way we're declaring namespaces in the spine files and in the target files from S-GA. They're related because in both, the TEI is the default namespace, lacking a prefix declaration. We think that means we can and should remove the tei: prefix and are proceeding to do so now...

Variorum Spine Namespace Declaration:

<TEI xmlns="http://www.tei-c.org/ns/1.0"
     xmlns:pitt="https://github.com/ebeshero/Pittsburgh_Frankenstein"
     xmlns:mith="http://mith.umd.edu/sc/ns1#"
     xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse">

S-GA Files Namespace Declaration:

<surface xmlns="http://www.tei-c.org/ns/1.0"  xmlns:mith="http://mith.umd.edu/sc/ns1#"
  xml:id="ox-ms_abinger_c56-0045" ulx="0" uly="0" lrx="3847" lry="5342"
  mith:shelfmark="MS. Abinger c. 56" mith:folio="21r" partOf="#ox-frankenstein_volume_i">

@zmbq @raffazizzi

ebeshero commented 5 years ago

@zmbq I have now output new spine files to remove the tei: namespace prefix from string-pointer xpaths to SGA files with this commit: https://github.com/PghFrankenstein/fv-data/commit/759e6d0a8fd27a133cc967f25de051c92634c84a

Let's hope this works.

ebeshero commented 5 years ago

I'm leaving this open until we're sure we've definitively resolved whether we're better off without the namespace prefix.

zmbq commented 5 years ago

OK, this is a bigger problem. I couldn't resolve any XPath queries against the TEI. Turns out you can't match elements in the default namespace -you need to give the same namespace another prefix, and query with that prefix. This can be seen here https://docs.microsoft.com/en-us/dotnet/standard/data/xml/xpath-queries-and-namespaces (under the title The Default Namespace) and here http://www.edankert.com/defaultnamespaces.html.

So what we need to do is reintroduce the tei namespace to the XPath expressions, and also add it to the document as another namespace. So the TEI element should look like:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0" xml:id="fMS-C09">

Everything else stays the same.

ebeshero commented 5 years ago

@zmbq Wow—okay! I’ll plug those into the pipeline and ping you here when it’s ready.. Thanks for the readings, too. I wonder who was the first person to say, “It’s always a namespace issue”?

zmbq commented 5 years ago

I think it's always Lupus.

ebeshero commented 5 years ago

Lupus or not...I'm redoing the namespaces now. First the spine, then I'm writing an identity transform to pop the namespace into the variorum-chunks.

zmbq commented 5 years ago

I just need the xmlns:tei attribute on the root element, don't change anything else inside (or we may have issues with our working-with-no-styling-yet visualization code).

ebeshero commented 5 years ago

Understood!

raffazizzi commented 5 years ago

Hello @zmbq! @ebeshero and I have been talking about this and we'll go ahead and add the declaration as you requests. But if we want to make this tool as general as possible, in some (most ?) cases, pointers will refer to documents we have no control over, meaning that it wouldn't be possible to add the prefix declaration to the source.

Here's a suggestion on how to fix this. Since the XML document needs to be consumed in order to resolve the XPath, it's possible to add a namespace prefix declaration for the default namespace in case it's not there already. The prefix would match the one in the XPath, or would be set via a parameter on instantiation if we don't want to trust the XPath directly.

zmbq commented 5 years ago

I don't think this is a bug in the code, I think it's a problem with the inputs. XPaths don't resolve default namespace elements. People should supply proper XPath experssions - so they shouldn't refer to default namespace elements, but rather add an explicit namespace (which can be in addition to the default namespace).

We have no way of knowing which namespace declaration to add, and even if we did, updating the XPath expression to add that namespace to XPath elements that have no namespace prefix seems like a daunting task (you'll need to parse the XPath properly, so you'll need some sort of grammer to do that).

The documentation should clearly state, however, that the XPath expressions are resolved in the context of the target document - so the namespaces defined there are used. This can be confusing, because the s are in their own XML document with its own namespaces.

raffazizzi commented 5 years ago

I'm not suggesting there's a bug, I'm offering a feature request :)

But I still don't understand. You're saying that 1. the XPath expressions must use namespace prefixes because they can't rely on default namespaces, but you're also saying that 2. the target document must have the namespace prefix declared there as well in order to be resolved.

1 is easily enforced and we can make sure people supply proper XPath expressions. But we cannot enforce 2 because we won't always have control of the documents we target. Most XML documents will use default XML namespaces and use prefixes only for external vocabularies if at all needed.

What I'm suggesting is this: input XPath: foo.xml#string-range(//tei:TEI, 0, 2) foo.xml: <TEI xmlns="www.tei-c.org/ns/1.0">foo</TEI>

Before resolving the XPath, add declaration to the root (this is possible because we already have a copy in memory of foo.xml before applying the XPath) DOM: <TEI xmlns="www.tei-c.org/ns/1.0" xmlns:tei="www.tei-c.org/ns/1.0">foo</TEI>

Now the XPath should resolve correctly.

ebeshero commented 5 years ago

The spine files are updated with the tei: prefix back in place on the string-range pointers in fv-data now!

One of the readings you posted showed that a parsing tool could be reaching for namespaced elements with its own idiosyncratic prefix (say, to reach for TEI elements with a twinkie: prefix), while the source XML defined a different prefix entirely (tei:). I wonder how arbitrary these are--it seems most important to assert the namespaced condition of elements in the pointing/parsing tool, but the prefix you decide to use doesn't seem to matter. I wonder how necessary it is to set a prefixed namespace in the XML being interpreted...(Meanwhile, that's my next step...)

ebeshero commented 5 years ago

In Real Life, of course, we do whatever we must for XPath to work!

zmbq commented 5 years ago

@raffazizzi , what you suggest is problematic in two aspects - a conceptual one and a real-world one. The conceptual one is that you allow people to give you inconsistent data, and you try to fix it for them. People should just give us consistent data.

The real world one is more important in my opinion, and that is - we have no idea which namespaces to add to the target document. The XPath expressions can contain any number of namespaces. If it has three namespaces, how can we decide which one is the default one? The XPath expressions just has the namespace names, not their URI. So we'll need to guess, which will let to very hard to figure out errors, when unsuspecting users suddenly realize we're matching against the wrong elements.

zmbq commented 5 years ago

One of the readings you posted showed that a parsing tool could be reaching for namespaced elements with its own idiosyncratic prefix (say, to reach for TEI elements with a twinkie: prefix), while the source XML defined a different prefix entirely (tei:). I wonder how arbitrary these are--it seems most important to assert the namespaced condition of elements in the pointing/parsing tool, but the prefix you decide to use doesn't seem to matter. I wonder how necessary it is to set a prefixed namespace in the XML being interpreted...(Meanwhile, that's my next step...)

I'm initializing the XPath processor with the namespaces in the target document (the original chunk, not the spine), so these are the namespaces that should appear in the XPath expressions.

Before this tool is available for general use, we will need to build a utility that validates all the data, and tells the user what's wrong in case something is wrong. This utility will need to go over all the pointers and make sure they are valid references. Otherwise, nobody is every going to be able to use this.

raffazizzi commented 5 years ago

@zmbq, the idea is to create a tool that is able to target any XML on the web -- are we going to go around telling people to adjust their data so we (not them) can use it? Or are we going to work with what's out there? Furthermore: it's not bad practice to not add an xml prefix when you have a default one! The XPath implementation you're using should be able to work with the default namespace instead.

I take your point about the real world: it's dangerous to add a namespace prefix based solely on the XPath string. But this can be easily parametrized via an option at instantiation or what have you.

ebeshero commented 5 years ago

Sorry for my delay on implementing the prefixed namespace in the root element: I'm trying to implement this in an XSLT at the last stage where we output the variorum-chunks, and I'm running into interesting XSLT problems to do with declaring a prefixed namespace in addition to the default namespace. To see what I'm talking about (maybe this isn't a problem, just an oddity), take a look at this example root element from my output for a variorum-chunk file:

<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:pitt="https://github.com/ebeshero/Pittsburgh_Frankenstein"
xmlns:mith="http://mith.umd.edu/sc/ns1#" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse
 xmlns:ns0="http://www.tei-c.org/ns/1.0" ns0:tei="http://www.tei-c.org/ns/1.0">

See those last two at the bottom? I don't know how those will work in real life (as in, I know that's not what you asked for...) so I'm trying something else that @raffazizzi suggested!

zmbq commented 5 years ago

@zmbq, the idea is to create a tool that is able to target any XML on the web -- are we going to go around telling people to adjust their data so we (not them) can use it? Or are we going to work with what's out there? Furthermore: it's not bad practice to not add an xml prefix when you have a default one! The XPath implementation you're using should be able to work with the default namespace instead.

I take your point about the real world: it's dangerous to add a namespace prefix based solely on the XPath string. But this can be easily parametrized via an option at instantiation or what have you.

I'm not sure I understand your use-case - someone is doing variations of someone else's documents? In that case, as I suggested a couple of weeks ago, they should make copies of those documents (because they will need to update their variations whenever there's a change), and they can add the correct namespace to that copy.

We can do as you suggest, but I'm quite certain it will cause a lot more problems than it will solve. Anyway, this is not for next week obviously.

zmbq commented 5 years ago

Sorry for my delay on implementing the prefixed namespace in the root element: I'm trying to implement this in an XSLT at the last stage where we output the variorum-chunks, and I'm running into interesting XSLT problems to do with declaring a prefixed namespace in addition to the default namespace. To see what I'm talking about (maybe this isn't a problem, just an oddity), take a look at this example root element from my output for a variorum-chunk file:

<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:pitt="https://github.com/ebeshero/Pittsburgh_Frankenstein"
xmlns:mith="http://mith.umd.edu/sc/ns1#" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse
 xmlns:ns0="http://www.tei-c.org/ns/1.0" ns0:tei="http://www.tei-c.org/ns/1.0">

See those last two at the bottom? I don't know how those will work in real life (as in, I know that's not what you asked for...) so I'm trying something else that @raffazizzi suggested!

Now I have a mission for Utrecht! Convince you to stop using XSLT and use a proper programming language instead! I know TEI people love XSLT, but you really shouldn't. XSLT was probably designed by Satan.

ebeshero commented 5 years ago

@zmbq Actually our very own project is a use-case, as our early goal was to process SGA's XML unmodified from its original source. There are all kinds of research possibilities for processing XML that wasn't home-cooked and pre-processed in your own space, and if we did more research that way, we wouldn't be complaining so often about "silos" right?

ebeshero commented 5 years ago

(Trying Raff's strategy now for getting that blasted prefixed namespace in place...)

ebeshero commented 5 years ago

w00t! Okay we've got it working now--thanks @raffazizzi ! The root elements on variorum-chunk files will look like this now:

<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:pitt="https://github.com/ebeshero/Pittsburgh_Frankenstein"
xmlns:mith="http://mith.umd.edu/sc/ns1#" xmlns:th="http://www.blackmesatech.com/2017/nss/trojan-horse"
xmlns:tei="http://www.tei-c.org/ns/1.0">
ebeshero commented 5 years ago

It'll be just a few minutes more until I've got the variorum-chunks updated for fv-data.

ebeshero commented 5 years ago

@zmbq Okay! Here are the variorum-chunk files with the new tei: prefixed namespace in addition to the default TEI namespace defined on the root element, as requested, in this commit: https://github.com/PghFrankenstein/fv-data/commit/4df4b2a28f7f63a3cb518830d0905e7a29d8fd15 It's ready to pull in from fv-data now.

zmbq commented 5 years ago

@ebeshero , you have updated all the files except the MS chunks that the Spine files point to. Unfortunately I am now quite certain we will not be able to handle pointers to the MS edition by the end of this week. We are proceeding with the four other variants, and if we have time we'll get back to this. I suggest you plan your presentation as if MS is not there, at most - you'll be presently surprised.

ebeshero commented 5 years ago

The MS files generated as a result of my pipeline went into their old locations in reseq-MS-chunks. They are not S-GA files, and I am afraid I concentrated on getting the pointers correct. As it is now Trivially Easy to correct the SGA files like the others I ask that you give me a few minutes to correct the matter.

Elisa

Sent from my iPhone

On Jun 30, 2019, at 8:16 AM, Itay Zandbank notifications@github.com wrote:

@ebeshero , you have updated all the files except the MS chunks that the Spine files point to. Unfortunately I am now quite certain we will not be able to handle pointers to the MS edition by the end of this week. We are proceeding with the four other variants, and if we have time we'll get back to this. I suggest you plan your presentation as if MS is not there, at most - you'll be presently surprised.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ebeshero commented 5 years ago

@zmbq I have now corrected the namespace line on the MS chunks that the Spine files point to, that is, those in variorum-chunks. I apologize for the oversight which came from me attempting to find a robust place to plant the prefix change in my postcollation processing pipeline, and ask that you please attempt to work with the SGA files now.