Blacksmoke16 / oq

A performant, and portable jq wrapper to facilitate the consumption and output of formats other than JSON; using jq filters to transform the data.
https://blacksmoke16.github.io/oq/
MIT License
190 stars 15 forks source link

XML namespaces stripped when referenced #85

Closed LoganBarnett closed 3 years ago

LoganBarnett commented 3 years ago

If I have a root element which declares an xmlns prefix, that prefix is stripped from the elements in the document. This is causing some trouble in some document transformations I'm doing, since the validators which consume the transforms care very much about these namespaces.

Additionally, the namespace declaration itself doesn't survive.

Here is an example in which the prefix is preserved:

$ oq -i xml -o xml <<< '<?xml version="1.0"?><a:foo>bar</a:foo>'                                              
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <a:foo>bar</a:foo>
</root>

Inspecting the JSON output reveals that it is similarly preserved:

$ oq -i xml <<< '<?xml version="1.0"?><a:foo>bar</a:foo>'       
{
  "a:foo": "bar"
}

If I add an xmlns declaration for that prefix, the prefix is stripped from the output and the xmlns attribute itself is also removed.

$ oq -i xml -o xml <<< '<?xml version="1.0"?><a:foo xmlns:a="http://www.w3.org/1999/xhtml">bar</a:foo>'
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <foo>bar</foo>
</root>

The JSON output mirrors the behavior:

$ oq -i xml <<< '<?xml version="1.0"?><a:foo xmlns:a="http://www.w3.org/1999/xhtml">bar</a:foo>'       
{
  "foo": "bar"
}

I admit my knowledge about deeper XML validation and transformation is limited - I might be missing something here. However I think there's a lot of value in oq making XML transformations without necessarily changing the rest of the document (within reason - bad XML is bad XML and it's not reasonable to support that). If this behavior is intentional, perhaps we could have some kind of flag in which we could disable it?

Thanks for all the work on oq! It's been an invaluable tool at my workplace :)

Blacksmoke16 commented 3 years ago

Unfortunately the spec used to inform the transformation doesn't handle namespaces, so we're on our own to figure out how to best handle this. I would think the structure of your last example there should be:

{
  "a:foo": {
    "@xmlns:a": "http://www.w3.org/1999/xhtml",
    "#text": "bar"
  }
}

However this technically a breaking change as the structure of the transformed data would change if an element had a namespace, when previously it was a simple key/value pair.

If this behavior is intentional, perhaps we could have some kind of flag in which we could disable it?

This might be the way forward as it would keep backwards compatibility, but still support this use case. I can also note that in 2.x of oq, this'll be made the default behavior.

I still need to update tests and add the flag, but I pushed up https://github.com/Blacksmoke16/oq/compare/xml-namespaces that includes the logic that will be behind the flag. With this code your examples now produce this, with the JSON version being what I included earlier:

./bin/oq -i xml -o xml <<< '<?xml version="1.0"?><a:foo xmlns:a="http://www.w3.org/1999/xhtml">bar</a:foo>'
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <a:foo xmlns:a="http://www.w3.org/1999/xhtml">bar</a:foo>
</root>
LoganBarnett commented 3 years ago

Thanks for the quick turnaround! :)

I would think the structure of your last example there should be:

Perfect! So too is the output from your branch.

I'm using nix on my local, which has a bug building the latest Crystal. I will see if I can get it stood up and test out the branch against my more complex documents I'm currently working with. However, please don't take that as an ask to hold up work if wherewithal permits.

Blacksmoke16 commented 3 years ago

I'm using nix on my local, which has a bug building the latest Crystal.

What's the issue?

LoganBarnett commented 3 years ago

During the build of crystal I see this output:

Failures:

  1) Crystal::Command::FormatCommand formats stdin (bug + show-backtrace)

I'm not sure if it matters, but I see loads of the repeated warning below as the tests are run, which seem fairly critical:

ld: warning: directory not found for option '-L/nix/store/wljzw4ad0lk3afqk7p3zkaqc4gn5hy32-crystal-0.32.1-lib/crystal'

I'm using Nix on macOS. Occasionally build differences in the macOS ecosystem makes Nix packages fail and I don't think it has quite as many eyes on it as Linux (or first class Nix) users enjoy. I intend to file a bug at some point, but they might know about it already since the package is marked as "broken" - at least for macOS users.

Apologies if this derails this particular issue.

LoganBarnett commented 3 years ago

I was able to build a binary from the xml-namespaces branch using my jumpbox :) I have a refined use case:

$ oq -i xml <<< '<?xml version="1.0"?><foo xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:a="http://www.w3.org/1999/xhtml">bar</foo>'
{
  ":foo": {
    "@xmlns:": "urn:oasis:names:tc:SAML:2.0:metadata",
    "#text": "bar"
  }
}

To get this to happen I had to include an unprefixed namespace. The result is that prefixed namespace is still stripped, and now the element names are prefixed with a... blank namespace? In this case it is :foo. There's a certain logic to this - I'm not ready to claim such behavior is undesirable.

If I leave off the prefixed namespace, the output is the same:

$ oq -i xml <<< '<?xml version="1.0"?><foo xmlns="urn:oasis:names:tc:SAML:2.0:metadata">bar</foo>'
{
  ":foo": {
    "@xmlns:": "urn:oasis:names:tc:SAML:2.0:metadata",
    "#text": "bar"
  }
}
Blacksmoke16 commented 3 years ago

I intend to file a bug at some point

Yea that's prob your best bet, I don't really know anything about nix.

There's a certain logic to this - I'm not ready to claim such behavior is undesirable.

Ok, so I think these are easy fixes. I pushed up another commit that should resolve these if you want to try again.

LoganBarnett commented 3 years ago

I've been playing around with it today. I'm running into trouble with my transformation script and I haven't been able to create a simple reproduction - so it might just be my script at this point. Tomorrow I plan on figuring out what the intended behavior of XML namespaces should actually be so I can make an intelligent ask ;)

I really appreciate all of the effort and the quick turnaround!

LoganBarnett commented 3 years ago

Inspired by some libraries that handle XPath and by reading the XML spec itself, I have some suggestions to make for oq.

Here’s a fact list:

  1. XML namespaces are URLs. Cling strongly to this.
  2. Namespaces aren’t prefixes - they are always URLs.
  3. The URLs can be associated with a prefix.
  4. There is also a “default” namespace that can be applied, which has no prefix.
  5. Namespaces and namespace prefixes apply to the node in which they are declared, as well as any child nodes.
  6. Namespaces can apply to attributes and nodes alike.
  7. Default namespaces can be overridden by a child in a hierarchy that already has a default namespace.
  8. A given XML document can declare its namespaces in a variety of ways, so one cannot assume a prefix nor a default since that changes the meaning of the document.

XPath libraries handle this by having the consumer establish namespaces that are important to the consumer, and assign a prefix to it. The consumer expresses “Give me what is on this path” but the namespace is essential to that expression. For example, these are two different nodes:

<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b">
  <a:foo>
    herp
  </a:foo>
  <b:foo>
    derp
  </b:foo>
</root>

Both of these foo elements should be thought of as entirely independent. An oq query to get the second foo would look like this:

oq -i xml --xmlns "b=https://b" '.["b:foo"] .["#text"]' <<< '
<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b">
  <a:foo>
    herp
  </a:foo>
  <b:foo>
    derp
  </b:foo>
</root>
'

And the result should be:

"derp"

Additionally, the same oq expression should work for the following document:

<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns="https://b">
  <a:foo>
    herp
  </a:foo>
  <foo>
    derp
  </foo>
</root>

These are the same documents because the last foo in each document has the namespace of https://b, which is true regardless of its regards to prefixing.

The contrived --xmlns argument here provides the path selection mechanism itself with means in which oq and its underlying XML library can traverse the document. In our example, we kept the prefixes the same, but we are under no obligation to do that:

oq -i xml --xmlns "c=https://b" '.["c:foo"] .["#text"]' <<< "$document"

Here the prefix is changed, but both documents above will satisfy the query and produce an output of “derp”. This works because the jq query understands the c prefix to be the same namespace that the document has established as the prefix b.

When oq goes to render the namespaces, it can potentially (legally) rearrange the xmlns declarations so long as the result is all of the namespaces are the same.

For example, it should be an acceptable transformation to start with this:

<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns="https://b">
  <a:foo>
    herp
  </a:foo>
  <foo>
    <bar xmlns="https://c">
      <baz xmlns="https://d">
      </baz>
    </bar>
  </foo>
</root>

And wind up with this:

<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b" xmlns:c="https://c" xmlns:d="https://d">
  <a:foo>
    herp
  </a:foo>
  <b:foo>
    <c:bar>
      <d:baz>
      </d:baz>
    </c:bar>
  </b:foo>
</root>

Where all of the default prefixes have become explicit prefixes.

Clearing the default namespace is done by setting the default namespace to "". Were we to include a default namespace reset from the example above it would look like this, with focus on the qux node.

<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns="https://b">
  <a:foo>
    herp
  </a:foo>
  <foo>
    <bar xmlns="https://c">
      <baz xmlns="https://d">
        <qux xmlns="" />
      </baz>
    </bar>
  </foo>
</root>

And wind up with this:

<?xml version="1.0" ?>
<root xmlns:a="https://a" xmlns:b="https://b" xmlns:c="https://c" xmlns:d="https://d">
  <a:foo>
    herp
  </a:foo>
  <b:foo>
    <c:bar>
      <d:baz>
        <qux />
      </d:baz>
    </c:bar>
  </b:foo>
</root>

The Uniqueness of Attributes section of the spec I’m less confident about, but after reading it a few times I can only ascertain that it means to show that attributes don’t inherit default namespaces.

I know this is a lot to digest! I appreciate the time spent so far. I’m happy to help sort things out if there’s some confusion here. It took me some time to really grok it - I didn’t realize XML could be so complicated.

Blacksmoke16 commented 3 years ago

@LoganBarnett I guess the one question I have is, are you suggesting oq support a form of xpath as provided by the --xmlns option?

LoganBarnett commented 3 years ago

@Blacksmoke16 Not xpath, no. I think how oq converts into something jq can consume, and then hands off the work to jq is masterful and it would be a pity to abandon. But I do think there is great utility in saying "when I have this prefix, it means this namespace. Now you can treat nodes of this namespace equally to the ones found in the document". That should be the only concept the suggested xmns parameter would indicate. There may be other better or acceptable options available, but I wanted to show the thought put into it and I think this is a place where those xpath libraries got it right in terms of a lateral indication of a namespace and its prefix.

From my perspective, the implementation hoops oq potentially has to jump through is staggering. I am ignorant to much oq's XML dependency will help or not in regards to this. I'm happy to help think up alternatives and even partial solutions. I'm getting up to speed with crystal so hopefully I can do more than offer suggestions.

Earlier you mentioned oq 2.x - and perhaps we might have to wait for this. I want to be sensitive to your time as well as my work's tolerance for how long it takes me to accomplish an automated XML transformation ;)

Some context for fun, feel free to ignore since I tend to write a lot: Prior to this, we were using sed with success - partially due to luck. As I'm sure we can agree, parsing XML and friends with regular expressions is perilous ;) That luck ran out when we found the XML spec indicated a sequence of sibling elements, which the validator enforced. oq handled that perfectly.

Blacksmoke16 commented 3 years ago

Not xpath, no

Ok good :) I see what I did tho. I mistook the filter to be addition args to that option :see_no_evil:.

Earlier you mentioned oq 2.x - and perhaps we might have to wait for this

That should be the only concept the suggested xmns parameter would indicate.

Yea I think ideally this behavior of producing XML that is semantically equal to what was input when going to/from XML without any transformation is where I want to land. This behavior would become the default when/if oq ever gets to 2.0. However for now we can introduce some option, such as --xmlns to indicate it should produce output with namespaces taken into consideration.

I did manage to fix an additional issue related to some of your latest examples if you want to rebuild and try again. Apparently the methods that are available to get a node's namespace also include the parent's namespaces as well. So your 2nd to last example with a, b, c, and d namespaces is transformed into:

{
    "root": {
        "@xmlns:a": "https://a",
        "@xmlns": "https://b",
        "a:foo": {
            "@xmlns:a": "https://a",
            "@xmlns": "https://b",
            "#text": "herp"
        },
        "foo": {
            "@xmlns:a": "https://a",
            "@xmlns": "https://b",
            "bar": {
                "@xmlns": "https://c",
                "@xmlns:a": "https://a",
                "baz": {
                    "@xmlns": "https://d",
                    "@xmlns:a": "https://a"
                }
            }
        }
    }
}

Which I suppose is semantically equal, just a bit more verbose? :shrug:. Going to look into if I just need to bind a diff function to get ones only defined on a given node.

I'm getting up to speed with crystal so hopefully I can do more than offer suggestions.

Glad to hear, it's a nice lang ;).

I am ignorant to much oq's XML dependency will help or not in regards to this.

The XML lib oq is using is from Crystal's standard library which is basically just a binding around http://xmlsoft.org. So it should be fairly robust, assuming you know C and how libxml works :S.

Blacksmoke16 commented 3 years ago

Ok, I was able to monkeypatch this in:

class ::XML::Node
  def node_namespaces : Array(Namespace)
    namespaces = [] of Namespace

    return namespaces unless (ns = @node.value.ns_def)

    while ns
      namespaces << Namespace.new(document, ns)
      ns = ns.value.next
    end
    namespaces
  end
end

Which makes it now represented like:

{
    "root": {
        "@xmlns:a": "https://a",
        "@xmlns": "https://b",
        "a:foo": {
            "#text": "herp"
        },
        "foo": {
            "bar": {
                "@xmlns": "https://c",
                "baz": {
                    "@xmlns": "https://d"
                }
            }
        }
    }
}

Which looks to be exactly what we'd expect yea?

Blacksmoke16 commented 3 years ago

@LoganBarnett I created https://github.com/Blacksmoke16/oq/pull/89 that implements the actual --xmlns option and adds specs. If you could check that out and confirm it looks good on your test cases that would be :100:.

I also extracted the bug related to element prefixes being dropped and added specs for the current behavior as part of #88 and #90. I think that part is fine to release as part of 1.3, while keeping the changes related to namespaces behind the option until 2.x.

LoganBarnett commented 3 years ago

Thanks so much for iterating with me on this! :)

Alright, I got a workable crystal environment going where I can more quickly iterate on this now.

Here's a quick test:

bin/oq -i xml -o xml --xml-root  '' --xmlns -- '.["a:foo"]' <<< \
  '<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'

I would expect this or something semantically similar:

<?xml version="1.0"?>
<a:baz xmlns:a="http://bar">qux</a:baz>

I get:

jq: error: syntax error, unexpected ':', expecting $end (Unix shell quoting issues?) at <top-level>, line 1:
a=http://bar      
jq: 1 compile error

This is the help I see with xmlns:

    --xmlns                          If XML namespaces should be parsed.  NOTE: This will become the default in oq 2.x.

Which leaves me with the impression that this is a toggle flag and not necessarily a flag where I can provide a namespace. The error makes me think the argument I think I'm passing to the --xmlns flag is actually just the query being sent to jq, so I took it out and this is what I see:

bin/oq -i xml -o xml --xml-root  '' --xmlns '.["a:foo"]' <<< \
  '<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'
oq error: Error in attribute

Here's a couple of other things I tried:

bin/oq -i xml -o xml --xml-root  '' --xmlns '.foo' <<<  \
 '<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'
<?xml version="1.0" encoding="UTF-8"?>

bin/oq -i xml -o xml --xml-root  '' --xmlns '.["a:foo"]' <<< \
  '<?xml version="1.0"?><a:foo xmlns:a="http://bar"><a:baz>qux</a:baz></a:foo>'
oq error: Error in attribute

One of the tricky things I found in the XML namespace stuff is that the prefix (a in this document) is arbitrary, and might not even be there. It's a reference to an actual namespace (always a URL). Because of that I think the xmlns flag would be best served taking one or more prefix->namespace mappings. Perhaps multiple namespaces can be declared via multiple instances of an xmns flag? For example --xmns 'a=https://a' --xmns 'b=https://b', but whatever works easier for you.

Based on your specs and examples above I think we're in a really good spot. I think the only thing I'm seeing now is the need to address into the document with a query that's namespace aware.

Blacksmoke16 commented 3 years ago

Because of that I think the xmlns flag would be best served taking one or more prefix->namespace mappings

@LoganBarnett I think I don't fully follow what purpose that mapping would have. Like what does --xmns 'a=https://a' --xmns 'b=https://b' actually do/mean?

I also think you found a diff bug. For example:

echo $'{"foo":"bar"}' | oq .foo
"bar"

echo $'{"foo":"bar"}' | oq .["foo"]
jq: error: foo/0 is not defined at <top-level>, line 1:
.[foo]
jq: 1 compile error

Can deff get that fixed. Seems the quotes aren't making it to jq invocation.

EDIT: NVM, this isn't a bug, just a case where you need to quote the filter. I.e. '.["foo"]'. EDIT2: oq error: Error in attribute is a bug tho I think. Seems to be related to it trying to write an attribute to an element that doesn't exist or something.

LoganBarnett commented 3 years ago

Apologies if I sound repetitious here - this took me some time to grok and even more time to figure out how to convey it without sounding like a long winded spec.

The namespace is the URL, which has a very static semantic in XML. If I am using elements that are bound to the namespace https://hot, then all document producers are expected to be annotated with that namespace.

Here's some documents, which are all semantically similar:

<?xml version="1.0" ?>
<foo xmlns="https://foo-namespace">
  bar
</foo>
<?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
  bar
</f:foo>
<?xml version="1.0" ?>
<foo:foo xmlns:foo="https://foo-namespace">
  bar
</foo:foo>

However this one is not a node of the namespace https://foo-namespace and therefore is not semantically equivalent, even though it looks pretty similar:

<?xml version="1.0" ?>
<foo>
  bar
</foo>

Additionally, I could use the same prefix but a different namespace and semantically the documents are still different. The document below is not semantically the same as this post's first examples. It's prefix is the same, but the prefix is irrelevant and all that matters is the namespace.

<?xml version="1.0" ?>
<f:foo xmlns:f="https://bar-namespace">
  bar
</f:foo>

XPath query tools (like nokogiri) allow one or more namespaces to be declared. This namespacing isn't for the document but instead for the query.

Supposed we have this document, and we want to get the contents of the bar node. Let's call this one foo-bar.xml:

<?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
  <f:bar>
    baz
  </f:bar>
</f:foo>

We need some mechanism or notation to indicate that we're looking for a bar + namespace="https://foo-namespace" which is under foo + namespace="https://foo-namespace". The + namespace="https://foo-namespace" section is not some official notation but just a way of showing these nodes are decorated with this additional namespace information, which could distinguish them potentially from other foo and bar nodes. Plus tools like XPath don't have any other means in which to say "I want the node from this namespace". Instead they use a mirrored feature for their queries which allows them to configure and use their own prefixes.

Using our document directly above, we could use this oq query to get "baz":

oq -i xml -o xml --xml-root '' --xmlns 'a=https://foo-namespace' '.["a:foo"] .["a:bar"]' < foo-bar.xml

Now the curve-ball, which introduces the need to support multiple namespaces. Suppose bar is actually under the namespace https://bar-namespace, with the document now like this (let's call it foo-bar-independent.xml):

<?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
  <b:bar xmlns:b="https://bar-namespace">
    baz
  </b:bar>
</f:foo>

Now we need an additional namespace to annotate our query with. Here's where the suggestion for multiple --xmlns flags come in:

oq -i xml -o xml --xml-root '' --xmlns 'a=https://foo-namespace' --xmlns 'b=https://bar-namespace' '.["a:foo"] .["b:bar"]' < foo-bar-independent.xml

This way our query has a way to align the namespaces (which are just URLs). Does that kind of make sense? Feelings of revulsion aside ;) IMO this is very convoluted. In the wild I'm seeing varied styles of annotating the namespace, so I have reason to believe using some random prefix is quite common, and life seems like it would be much easier if we just didn't have them :\

LoganBarnett commented 3 years ago

I'll also add that declaring a default namespace also eliminates the need for a prefix entirely. So not only is the prefix completely arbitrary but it might not even be there.

Blacksmoke16 commented 3 years ago

@LoganBarnett Thanks for the explanation. I see what you want now.

If you know the structure of the XML document (as you would need to in order to setup the mappings), why couldn't you just do like:

./bin/oq -i xml -o xml --no-prolog --xml-root '' --xmlns '.["f:foo"] | .["b:bar"] | .["#text"]' < foo-bar.xml
LoganBarnett commented 3 years ago

The problem is that semantically correct documents may not use prefixes f, b, or any prefix whatsoever. That said, if oq emits a document whose prefixes are omitted from the node names but the namespace attributes are still preserved, we could reasonably process all semantically correct documents.

Then we could really have a .foo.bar.["#text"] to query it and we wouldn't need a namespace mapping. It wouldn't be truly semantically correct in all cases because hypothetically there could be multiple foo nodes which have different namespaces. I don't know how much that comes up in the wild, but I guess that it's very infrequent - enough to justify pushing off to a future date.

LoganBarnett commented 3 years ago

I think I've been going in circles with my explanation re: prefixes and namespaces a bit. A prefix could be likened to a variable and the namespace the variable's value. The variable (prefix) could be named anything, and in some cases isn't present at all. The value (namespace) is all we truly care about, and the variable (prefix) is just internal machinations that help us make sense of the code.

The mapping we could provide oq via xmlns confuses things a bit in understanding this topic, since it approaches namespaces and prefixes from a different angle (it's from the perspective of the query, not the document itself). We don't truly need oq to support namespace mappings for most cases, but I think there's value in seeing the model in its most precise form and then working back.

EDIT: Specify "it" in regards to an ideal vs practical implementation.

Blacksmoke16 commented 3 years ago

@LoganBarnett Ohh ok now i understand where the mapping fits in. The gist of it is you need a way to agnostically navigate the tree based on the namespace href no matter what the prefix is, if any. I.e. normalizing semantically equivalent elements into a standardized one that's easier to query.

I'll have to play around what that implementation would look like. Might actually be pretty easy.

LoganBarnett commented 3 years ago

I think we've achieved a mind meld! :D

Might actually be pretty easy.

Sweet!

You've had loads of patience through this. Thanks so much :)

I'm about to wrap up for the day but I think I could contribute some additional specs at the very least.

Blacksmoke16 commented 3 years ago

@LoganBarnett I pushed up dba7436 (#89), give that a try. I ended up keeping the --xmlns and adding another option to specify the aliases. Still need to add specs and probably raise if aliases are provided and not --xmlns but this should be in a working state to test with.

LoganBarnett commented 3 years ago

@Blacksmoke16 I wasn't able to get to this today due to some frantic work needs. I should be able to try it out tomorrow. Thanks!

LoganBarnett commented 3 years ago

Apologies for my absence.

I managed to do a quick test with the build. The namespaces are preserved nicely! I think my query is correct on the second example with .["a:foo"]. I did try .foo and that didn't work, which I think is reasonable based on my understanding.

$ bin/oq -i xml -o xml --xml-root '' --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.' < foo-bar-independent.xml
<?xml version="1.0" encoding="UTF-8"?>
<a:foo xmlns:a="https://foo-namespace">
  <b:bar xmlns:b="https://bar-namespace">
    baz
  </b:bar>
</a:foo>

$ bin/oq -i xml -o xml --xml-root '' --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.["a:foo"]' < foo-bar-independent.xml
oq error: Error in attribute

Thoughts?

This is the document I'm working with. The output from the first example is perfect. Thank you!

<?xml version="1.0" ?>
<f:foo xmlns:f="https://foo-namespace">
  <b:bar xmlns:b="https://bar-namespace">
    baz
  </b:bar>
</f:foo>
Blacksmoke16 commented 3 years ago

@LoganBarnett Glad to hear!

Regarding oq error: Error in attribute, the reason is more clear when you look at the JSON representation of the document:

{
    "@xmlns:a": "https://foo-namespace",
    "b:bar": {
        "@xmlns:b": "https://bar-namespace",
        "#text": "\n    baz\n  "
    }
}

Notice there is the xmlns:a attribute, but because you have --xml-root '' there is no element that it could be added to. I'm struggling a bit on if there's anything I can do regarding that, or if it would even be worth it.

LoganBarnett commented 3 years ago

@Blacksmoke16 I did the --xml-root '' to avoid the <root></root> sammich, since that changes the structure of the document. Is there something else I should be doing? Apologies if I missed something obvious.

Blacksmoke16 commented 3 years ago

@LoganBarnett The problem is xmlns:a="https://foo-namespace" should go on that <root> element. But since you're omitting the root element there's no where to put that attribute, hence the error.

To solve this I'd either have to produce a better error when this happens, or like maybe keep track when you're in the root context to know if those attributes should be skipped. :shrug:

EDIT: XML requires there be a root element, so in this case by excluding the root element you're causing it to try and generate invalid XML. A clearer error would prob be the better solution.

LoganBarnett commented 3 years ago

@Blacksmoke16 sorry, work got nuts there for a bit.

This makes sense to me, I think. I get that XML (and thus oq) requires a root element. oq can't assume that the jq query will always return a single node/element (as my query + document does here). oq addresses this by introducing the <root> node, and then you have --xml-root to override this. This only comes up when the query is not .

Using the same document as before, without --xml-root works dandy:

$ bin/oq -i xml -o xml --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.["a:foo"]' < foo-bar-independent.xml
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="https://foo-namespace">
  <b:bar xmlns:b="https://bar-namespace">
    baz
  </b:bar>
</root>

If I make --xml-root 'a:foo' then basically I'm back to the same document.

$ bin/oq -i xml -o xml --xml-root 'a:foo' --xmlns --namespace-alias 'a=https://foo-namespace' --namespace-alias 'b=https://bar-namespace' '.["a:foo"]' < foo-bar-independent.xml
<?xml version="1.0" encoding="UTF-8"?>
<a:foo xmlns:a="https://foo-namespace">
  <b:bar xmlns:b="https://bar-namespace">
    baz
  </b:bar>
</a:foo>

I think this works in a variety of cases - one of mine is that I'm sorting some sibling nodes in a document. I think I have it broken into two queries - one to do the sort and the other to assign it back. Doing both was problematic. I think in this case here I'd just sed out the root element, probably using an relatively safe unique identifier.

We should be good here! Thanks so much :)

Would you be okay with me contributing a separate documentation pull request for this new functionality?

Blacksmoke16 commented 3 years ago

We should be good here! Thanks so much :)

@LoganBarnett No problem!

Would you be okay with me contributing a separate documentation pull request for this new functionality?

If you want sure, probably wouldn't hurt to just add it in the README.