Doc element and embedded HTML/XML

smizell commented 9 years ago

The doc element allows for HTML, Markdown, or plain text. This means that its contents should always either be encoded or wrapped with CDATA when used within XML.

This may be implied with using XML, though some may think the spec allows for including HTML elements directly in the document. Just mentioning in case it should be explicitly mentioned.

My only issue here is a cosmetic one. For instance, CDATA preserves white space, so you have to do:

<alps version="1.0">
    <doc format="html"><![CDATA[<h1>Preserved White Space</h1>
<p>Preserved whitespace means my document is not as easy to read :)</p>
]]></doc>
    <descriptor id="foo" type="semantic" />
</alps>

This is even more important with Markdown (because Markdown can include HTML). Even if you didn't use CDATA, you still need to format the way below.

<alps version="1.0">
    <doc format="markdown"><![CDATA[# Heading

* Foo
* Bar
]]></doc>
    <descriptor id="foo" type="semantic" />
</alps>

Maybe ALPS is meant to be ready primarily by machines, and if so, this is understandable.

mamund commented 9 years ago

SM:

good stuff. 1) i totally agree that space-preservation is an issue worth noting (both for XML and HTML) 2) i don't want to require using CDATA (open for discussion on this) 3) i assume this would all "be the same" for JSON representations (e.g. when you want to smuggle XML docs in a JSON ALPS representation 4) we proly need some language somewhere. either in the spec [couched in MAY... terms] and/or some other best-practice document [couched in "it is wise to keep in mind..." kind of terms].

also, feel free to work up a PR that fixes this (e.g. speaks to the value of using CDATA and couches it in spec terms -- "Note that when the type property of a doc element is set to "XML" or "HTML" the value of the doc element MAY be wrapped in a CDATA block" -- or something like that

there may be other "BCP-style" (best common practice) stuff and that might mean we should put together an additional document (not IETF-style, just asciidoc) to capture that info.

cheers.

smizell commented 9 years ago

i don't want to require using CDATA (open for discussion on this)

I would guess that when you have HTML or XML as the value for a doc, you either must use CDATA or encode it (like <foo>bar</foo>). How you do this may depend on if you want to have some XSD that describes the ALPS XML. If you do, you would end up defining the contents of doc as a string, which would restrict people from using actual XML as its value.

When it comes to Markdown, I believe the rule still applies, though white space is important.

i assume this would all "be the same" for JSON representations

Smuggling XML only requires you to escape any double quotes in the string's value (I believe).

we proly need some language somewhere. either in the spec [...] and/or some other best-practice document [..]

I think you could add a little clarity in the spec that the doc should be treated as a string (which an XSD would also help with) and then have a best practices writeup if confusion arises. I think for those who use XML a lot, this issue with string values and XML-specific characters will be a common occurrence. Even a simple example somewhere showing HTML included in a doc would go a long way.

With that said, I'm no XML expert, so I don't know the language to be used here. Maybe "string" is the wrong term for the contents of doc, though it is a data type in XML schema.

mamund commented 9 years ago

How you do this may depend on if you want to have some XSD

again, i am not sure setting a requirement (via XSD or any other validator format) is a good idea. I'd like to allow ppl to do what works best/easiest for the tooling at hand.

Smuggling XML only requires you to escape any double quotes in the string's value (I believe).

which is why i don't want to set any spec standards on how the value of DOC "looks" -- does that make sense?

what i'm saying here is i am not (yet) seeing a direct harm or reason to constraint authors' choices here.

essentially, i think that this is an "author choice" issue -- one that does not require a spec change, but SHOULD appear in a BCP doc.

i could be missing something, tho.

On Mon, Oct 19, 2015 at 9:43 PM, Stephen Mizell notifications@github.com wrote:

i don't want to require using CDATA (open for discussion on this)

I would guess that when you have HTML or XML as the value for a doc, you either must use CDATA or encode it (like <foo>bar</foo>). How you do this may depend on if you want to have some XSD that describes the ALPS XML. If you do, you would end up defining the contents of doc as a string, which would restrict people from using actual XML as its value.

When it comes to Markdown, I believe the rule still applies, though white space is important.

i assume this would all "be the same" for JSON representations

Smuggling XML only requires you to escape any double quotes in the string's value (I believe).

we proly need some language somewhere. either in the spec [...] and/or some other best-practice document [..]

I think you could add a little clarity in the spec that the doc should be treated as a string (which an XSD would also help with) and then have a best practices writeup if confusion arises. I think for those who use XML a lot, this issue with string values and XML-specific characters will be a common occurrence. Even a simple example somewhere showing HTML included in a doc would go a long way.

With that said, I'm no XML expert, so I don't know the language to be used here. Maybe "string" is the wrong term for the contents of doc, though it is a data type in XML schema.

— Reply to this email directly or view it on GitHub https://github.com/alps-io/spec/issues/78#issuecomment-149397285.

fosrias commented 9 years ago

I'd like to allow ppl to do what works best/easiest for the tooling at hand.

Well, it seems to me, in the case of XML in particular, that most tooling has a good chance of choking on html in an xml tag or html in markdown.

I think some statement about the xml parsing safety of the contents of doc element. I.e. contents of a doc element SHOULD (MUST would be better) safe for standard xml parsing, etc. or something to that effect to give direction to authors to not screw everyone with a profile that chokes common tools.

mamund commented 9 years ago

ok, now we're getting somewhere.

this sounds to me as if we want to say that an ALPS document MUST not contain invalid XML -- right? that goes for both XML and JSON variants, i assume.

I can get behind that, just fine.

fosrias commented 9 years ago

Yep. That is the point. @smizell Like to propose some text/open a PR?

mamund commented 9 years ago

sounds good to me.

mamund +1.859.757.1449 skype: mca.amundsen http://amundsen.com/blog/ http://twitter.com/mamund https://github.com/mamund http://linkedin.com/in/mamund

On Mon, Oct 19, 2015 at 10:49 PM, Mark W. Foster notifications@github.com wrote:

Yep. That is the point. @smizell https://github.com/smizell Like to propose some text/open a PR?

— Reply to this email directly or view it on GitHub https://github.com/alps-io/spec/issues/78#issuecomment-149411759.

smizell commented 9 years ago

First, my question would be then, is the value of a doc a string, or is it up to the user? From the above it sounds like it should be up to the user when the user wants to include their own XML. The spec currently does not allow for XML as a format for the doc element, and specifies that any format not listed should be treated as plain text. Should there be an XML format? For me, I'm unsure if there should be given the following thoughts below.

Next, in regards to the XSD, a couple of thoughts. As for above, if you define the doc as a string and provide an XSD, you are simply defining the schema for the ALPS XML format. This shouldn't prevent a user from creating their own XSD that extends the base ALPS XSD. This allows the ALPS spec to be a standalone spec while allowing others to take ownership of it and create their own schemas (I think).

Additionally, I believe XML has namespacing that can help solve these kinds of things any way, but again, I'm not an expert :) I believe, instead of leaving off the schema for the spec, users could extend upon what is already there and include their own namespace inline with the xlmns attribute. Is this correct? Basically, you will still allow for extendability, but doing so the XML way.

mamund commented 9 years ago

so, let's get down to basics. what are we trying to prevent/enable?

i think we're trying to prevent invalid XML, but maybe not.

maybe we're trying to make sure the contents of the DOC element are always valid according to the @type property of the DOC element.

maybe we're trying to prevent/enable something else?

before we talk about "how" to do this (or what validator, etc. to use), let's be sure we know what we're trying to accomplish here.

On Mon, Oct 19, 2015 at 11:59 PM, Stephen Mizell notifications@github.com wrote:

First, my question would be then, is the value of a doc a string, or is it up to the user? From the above it sounds like it should be up to the user when the user wants to include their own XML. The spec currently does not allow for XML as a format for the doc element, and specifies that any format not listed should be treated as plain text. Should there be an XML format? For me, I'm unsure if there should be given the following thoughts below.

Next, in regards to the XSD, a couple of thoughts. As for above, if you define the doc as a string and provide an XSD, you are simply defining the schema for the ALPS XML format. This shouldn't prevent a user from creating their own XSD that extends the base ALPS XSD. This allows the ALPS spec to be a standalone spec while allowing others to take ownership of it and create their own schemas (I think).

Additionally, I believe XML has namespacing that can help solve these kinds of things any way, but again, I'm not an expert :) I believe, instead of leaving off the schema for the spec, users could extend upon what is already there and include their own namespace inline with the xlmns attribute. Is this correct? Basically, you will still allow for extendability, but doing so the XML way.

— Reply to this email directly or view it on GitHub https://github.com/alps-io/spec/issues/78#issuecomment-149425054.

fosrias commented 9 years ago

IMO, the key issue here is that standard tooling does not choke in trying to parse/process an ALPS document.

That includes:

Preventing invalid documents that don't parse, be they XML or JSON. Presumably this includes preventing invalid embedded XML in JSON and vice versa.
Preventing invalid content in the doc elements per @format property (this to a lesser extent IMO) so that it can be rendered directly with tools that can handle the format.

I think giving direction here is appropriate as authors may not recognize these gotchas directly.

mamund commented 9 years ago

yep - that all makes sense.

Preventing invalid documents that don't parse - if we use "valid document" that seems to make sense. to take a McIlroy POV[1]: so, maybe something like "ALPS document authors (including generators) MUST NOT produce invalid documents (e.g. invalid JSON, invalid XML). Document consumers SHOULD reject any document that is invalid (XML/JSON)."
Preventing invalid content in the doc elements per @format property - I think the above covers much of the danger inherent in the value property of the doc element. We may still want to give some BCP-style guidance on how to best insert anything other than text/plain within the value property of the doc element. the fallback in the docs that, when the consumer does not recognize/support the format property value, the content SHOULD be treated as text/plain might be extended to include cases where the content of the doc.value property cannot be parsed when using the value of the doc.format property. IOW, in case of misunderstanding or error, treat the content in doc.value as text/plain.

is this heading in the same direction you two are thinking?

[1] "Be strict in what you send and liberal in what you accept" (paraphrasing here)

smizell commented 9 years ago

This goes back to my question above, about how you would like to give direction in XML. Do you want to say that the value of doc is a string and the format attribute gives a hint on how to parse it? Or do you want to allow actual XML as the content along with string (making this some kind of enumeration of string and XML)? This may be important, because someone writing an ALPS XML library would have to accommodate both types otherwise things break.

Also, is this the XML way of allowing for extendability? For instance, from above @mamund said:

might be extended to include cases where the content of the doc.value property cannot be parsed

I think parsing a string based on the format and parsing actual XML are two different scenarios. The format may be foobar though if it's a string I don't care. But if the content is XML, my XML parser will parse the XML and be a different type. Hope that makes sense.

Lastly, is this same kind of extendability there for JSON? In other words, can someone put something other than a string in the value of a doc object? if so, parsers should be aware that this can happen I would say.

fosrias commented 9 years ago

how you would like to give direction in XML

Well, per Mike's prior comments and his wording here, it covers both. It has to parse, so either it includes XML directly or it includes CDATA. We are un-opinionated. Just make sure the whole thing is valid for its base media type.

Am I missing something? Seems Mike's wording leaves it up to the author however he decides to ensure it is valid. You as a client are going to have to introspect the doc element to figure out how to handle them regardless.

My concern is not the post introspection chain. It is that you can't even get to introspection cause your parsing library fails or incorrectly interprets the document if "valid" but malformed to be unambiguous (e.g. HTML in doc).

is this same kind of extendability there for JSON

I think whatever is in doc element in JSON has to be a string (i.e. in quotes regardless). The @format property needs to tell you how to handle the string if you want to parse it. That is HTML or Markdown will be present as a string.

Again, the generic wording samples Mike proposes address this as well (with maybe a little expansion in the same spirit).

mamund commented 9 years ago

I'll add that Stephen has a good point. do we want to REQUIRE that -- in the XML variant of ALPS, the content of doc.value be a string? doing this means that the content of doc.value is NOT part of the document -- just a string pile to be processed per the format property.

at first pass, this seems like the right approach. i haven't worked out the pro/con details, tho.

for the XML variant, this would be important for all values of doc.format BTW. e.g. "The content of doc.value MUST be treated as a simple string which SHOULD be parsed based on the value of doc.format." or something like that. this would be the same in the JSON variant (as we would not [i assume] want folks top smuggle JSON objects inside the doc.value element, either.

@smizell : am i finally getting up to speed on your POV?

smizell commented 9 years ago

@mamund Yes, you got it :) Keeping doc as string allows you to put anything you want in there and rely on parsing instructions from the given format. This applies to XML as well, as you could have a format="xml" and then wrap the XML in CDATA or encode it. That's very simple, and it allows for anything a user may want, really.

But as mentioned, allowing for literal XML means that my parser has to treat it as a string OR any simple/complex XML type. This is a step up from relying on format for parsing, because any XML will go ahead and parse the contents of doc. That is a lot more checking to do.

this would be the same in the JSON variant

Yes. You could put anything you want into the value there so long as it's valid JSON—even stringified JSON. But like with allowing for XML in doc, if you allow for actual JSON as the value, you run into a entirely new world of parsing. That would mean a parser would have to check for every JSON type and handle accordingly, which would mean you want to give some instructions/guidance when you get an array for a value. Making it always a string allows for simplifying those instructions and say "parse if you can."

The advantage here I see of requiring a string is that you can now do a schema for both XML and JSON, because after a quick glance I don't see other gray areas. That can be very powerful, as a parser could check not only the validity of the XML, but also the validity against the spec. A parser could then say, "Error: doc on line such and such should be a string, but it is a complex type." This is much nicer than, "Error: Object does not have method split" :)

This would seem that the solution here is:

The doc element's content MUST be a string.
The ALPS document must be valid XML, which may go without saying?

Schema files would be great, but are not required to solve this IMO.

mamund commented 9 years ago

ok, took me a while, but i see your final points as good spec material. the first one ("doc.value MUST be a string") is a definite improvement. the second point may be bashing the reader a bit?

@fosrias: you ok w/ this?

i yes, i say @smizell should feel free to work that up into a PR and submit.

thanks for taking the time to walk me through this.

fosrias commented 9 years ago

The doc element's content MUST be a string.

The ALPS document must be valid XML

The latter is worth saying.

+1 by me.

filip26 commented 4 years ago

Hi, I would just propose a slightly change:

The doc element's content MUST be represented as a string.

The wording could be improved, but the motivation is to allow a user to put any content that can be serialized as a string, even binary data (see #91).

mamund commented 3 years ago

cleaning this up in our AOOH...

MUST be valid XML
Discourage placing XML or HTML within the DOC Element
The doc element's content MUST be represented as a string.
Use of CDATA is RECOMMENDED

as far as schema goes, the DOC element is a string.

mamund commented 3 years ago

closed in pull #105

alps-io / spec

Doc element and embedded HTML/XML #78