Orchestrator 1.0.0-alpha is over-keen on CDATA sections

m8pple commented 3 years ago

When parsing the v4 file one-dev.xml from the PEP20 repository, it complains about missing CDATA sections.

Checking client file /mnt/e/dt10_all/POETS/poets_improvement_proposals/proposed/PIP-0020/xml/ic/tests/valid/L3-compilation/one-dev.xml
against grammar file ../Config/V4Grammar3.ocfg

(lin,col) refers to element closure location in client file

Structural analysis:

  1. (   4, 21) : Element /mnt/e/dt10_all/POETS/poets_improvement_proposals/proposed/PIP-0020/xml/ic/tests/valid/L3-compilation/one-dev.xml.Graphs.GraphType.Properties 
                  contains 0 instance(s) of element "CDATA": this (0) should be exactly 1

  2. (   5, 22) : Element /mnt/e/dt10_all/POETS/poets_improvement_proposals/proposed/PIP-0020/xml/ic/tests/valid/L3-compilation/one-dev.xml.Graphs.GraphType.SharedCode 
                  contains 0 instance(s) of element "CDATA": this (0) should be exactly 1

<snip>

  8. (  14, 34) : Element /mnt/e/dt10_all/POETS/poets_improvement_proposals/proposed/PIP-0020/xml/ic/tests/valid/L3-compilation/one-dev.xml.Graphs.GraphType.DeviceTypes.DeviceType.OnHardwareIdle 
                  contains 0 instance(s) of element "CDATA": this (0) should be exactly 1

  9. (  15, 32) : Element /mnt/e/dt10_all/POETS/poets_improvement_proposals/proposed/PIP-0020/xml/ic/tests/valid/L3-compilation/one-dev.xml.Graphs.GraphType.DeviceTypes.DeviceType.OnDeviceIdle 
                  contains 0 instance(s) of element "CDATA": this (0) should be exactly 1

...Client file /mnt/e/dt10_all/POETS/poets_improvement_proposals/proposed/PIP-0020/xml/ic/tests/valid/L3-compilation/one-dev.xml
 exhibits 9 accumulated (syntax/structure) errors in 4655808 msecs

It also thinks it took 1.2 hours to do so :)

Explanation (mainly aimed at ADB)

CDATA sections are almost never required as part of a grammar, as they are intended to be mostly invisible to the generators and consumers of XML. They are just there to make escaping easier, so they are the equivalent of writing

"My name is "\bob\""

versus

'My name is "bob"'

in python.

Or writing:

"My name is "\bob\""

versus

R"(My name is "bob")";

in C++11.

From a compliant XML parsers point of view, the following should all be treated as equivalent:

<Element />
<Element></Element>
<Element><![CDATA[]]></Element>
<Element><![CDATA[]]><![CDATA[]]></Element> They are all just an element with no children (either text or element). Only if you use deeper parsing APIs does one care whether you have a CDATA section, multiple CDATA sections, or none.

If empty CDATA sections are required then it could become very tricky to generate valid files from languages like python and JavaScript, as they try to hide as much as possible. All you do is append string content to an element, and then the generator decides how to output it.

Fix

The easiest way of fixing this is to change Config/V4Grammar3.ocfg so that instead of:

<Properties Properties="[0..1]"><CDATA CDATA="[1]"/></Properties>

it has:

<Properties Properties="[0..1]"><CDATA CDATA="[0..1]"/></Properties>

No idea if this causes problems later on, but it gets things parsing.

m8pple commented 3 years ago

This was incorrect (https://github.com/POETSII/Orchestrator/issues/201#issuecomment-844425443):

~~Though the suggested fix doesn't deal with completely valid things like handlers which are not in a CDATA section,~~ ~~or multiple CDATA sections that should be appended.~~

m8pple commented 3 years ago

Updated: this was incorrect due to the restriction that there is only one CDATA section, see https://github.com/POETSII/Orchestrator/issues/201#issuecomment-844425443.

It does mean that certain C code cannot be expressed in a handler section but that's not a big problem.

The idea of all text being held in a CDATA seems baked deep into the custom XML parser, for example at:

https://github.com/POETSII/Orchestrator/blob/6c6daa6f29eabebed4516bad25540a51ffde1504/Source/OrchBase/XMLProcessing/DS_XML.cpp#L106

Looking around, I'm not sure if the parser does entity decoding properly either. i.e. if you don't have a CDATA section, does it deal with & and so on?

An interesting test would be whether it would handle code like x=a[b[i]]>3;, which is valid code, but would need to be weirdly escaped in a CDATA. So something like:

<Handler>
<![CDATA[
x=a[b[i]]
]]>
<![CDATA[
>3;
]]>
</Handler>

Normally this would be put outside a CDATA, something like:

<Handler>
x=a[b[i]] &gt; 3;
</Handler>

Or a human would just be a space in, so ]]> converted to ]] >.

m8pple commented 3 years ago

This is partially fixed by 90ecbf7ebfca74c374beb1e8e352f79e2c3ef879

m8pple commented 3 years ago

I was partially talking crap, as a detailed reading of the spec says that if there is source code then it is in a CDATA section:

https://github.com/POETSII/poets_improvement_proposals/blob/2a355282f3aa3912bdacf4ec50bf25199bdf310d/proposed/PIP-0020/virtual-graph-schema-v4.rnc#L111-L126

I think this language was in there specifically because people wanted to roll their own XML parser.

However, the fix suggested above still applies: it is valid for an element with no source code to not have a CDATA section.

heliosfa commented 3 years ago

We will make CDATA optional and look at your fix for the other issue.

m8pple commented 3 years ago

The proposed patch from last month for making CDATA optional is included in #264

heliosfa commented 3 years ago

resolved in #264

POETSII / Orchestrator

Orchestrator 1.0.0-alpha is over-keen on CDATA sections #201

Explanation (mainly aimed at ADB)

Fix