Open jadams-tresys opened 2 years ago
From an outsider's perspective it would seem that a relatively easy fix for this would be to treat maxOccurs values above some threshold (5 or 10?) as unbounded internally.
An example of one of our schemas that triggers this error can be found here: https://github.com/DFDLSchemas/NITF It has an element with a maxOccurs="999".
This is correct, we could do so but that also means you deviate from the EXI specification which requires you do build all 998 grammars (which are the same) and the last one which is different.
Having said that, changing 999
to unbounded
makes a valid EXI stream for all occurrences up to 998. Only the last occurrence (based on the EXI specification) looks different.
Did you try to increase the Increase Thread Stack Size (-Xss)
in Java?
I did try to increase the stack size (setting it to 20M, which I believe is 20x the default). This simply made it take longer before overflowing.
I was unaware that the EXI spec mandated building all of the grammars. Would it be possible to simply reuse the same grammar for the first 998 and then create 1 more for the last one, or would that be considered breaking from the spec as well?
One may argue it is an implementation issue. I am pretty sure one can also implement it differently how grammars work in EXIficient. In its current fashion grammars may point to other grammar states without counting how often a given element has been seen before.
Comment: You could also consider using a slightly different schema without maxOccurs="999"
for EXI encoding/decoding (if possible).
Moreover, I don't think it is good practice to use such high limits. Let's look at Xerces which is maybe the most prominent XML validator in the Java World.
My Oxygen IDE is similar..
Note: I looked at https://github.com/DFDLSchemas/NITF and the schemas you provided. They are not complete.. I think (see file
nitf_extension_types.dfdl.xsd which points to <xs:import schemaLocation="com/mitre/jpeg/xsd/jpeg.dfdl.xsd" />
which does not exist)
@danielpeintner the missing JPEG schema is at https://github.com/DFDLSchemas/JPEG.
(Lots of our schemas are layered like this container/payload idiom where one schema includes/imports numerous others that come from completely separate projects.)
Re: unbounded vs. 999 or other finite limit.
In the cyber security world, our products must statisfy certifying authorities who use tools that inspect XML schemas and reject anything unbounded in XML schemas as a potential security vulnerability to a denial-of-service attack.
For example, for https://github.com/DFDLSchemas/mil-std-2045 schema, there are 7 uses of "unbounded" and we are required to remove them all.
In the short term it would be perfectly fine if Exificient had a threshold where above a certain limit, the maxOccurs would be treated by Exificient, as if it were 'unbounded'. I don't know the code base, but this seems like a relatively small change to add along with a way to turn this behavior on.
Does this sound feasible to you?
(Lots of our schemas are layered like this container/payload idiom where one schema includes/imports numerous others that come from completely separate projects.)
Re: unbounded vs. 999 or other finite limit.
Unfortunately this has been turned out to be an issue and was (is?) on the "ideas" list of an updated EXI version. see https://www.w3.org/XML/EXI/wiki/EXI2#.28Improvement.29_Memory-sensitive_EXI_Grammar_approach
Whether there will be ever a new version I cannot say.
In the short term it would be perfectly fine if Exificient had a threshold where above a certain limit, the maxOccurs would be treated by Exificient, as if it were 'unbounded'. I don't know the code base, but this seems like a relatively small change to add along with a way to turn this behavior on.
Does this sound feasible to you?
Technically it is not a big issue. In fact it could be easily done by replacing any "larger number" with "unbounded" when reading XML schema files. Having said that, this does not exist yet in the code and I am not sure if it should be part of the code. It could be easily put on top by creating the EXI grammars yourself, or modifying the InputStream before passing it https://github.com/EXIficient/exificient-grammars/blob/77413b7978c2b6fa4a97ab0c3debb16f12d2712f/src/main/java/com/siemens/ct/exi/grammars/GrammarFactory.java#L106
Note: There is still the very low likelihood that an encoding with maxOccurs="1000" (changed to unbounded
) differs in the case there are actually 1000 appearances..
@danielpeintner can you explain this statement just a bit further: Note: There is still the very low likelihood that an encoding with maxOccurs="1000" (changed to unbounded) differs in the case there are actually 1000 appearances.. When maxOccurs is N and there are N occurrences, how could that be different from N occurences when maxOccurs is unbounded?
I looked at that w3 page, and that discussion does not mention such an issue, or I didn't see it. It suggests that treating all maxOccurs > 1 as 'unbounded' is plausible.
Let me use an XML example element (of an XML schema) to describe the differences.
<xs:element name="foo" minOccurs="0" maxOccurs="1000" />
The EXI spec states clearly that for each state there is a grammar
g0:
SE(foo) -> g1 // it points to the next grammar
EE
g1:
SE(foo) -> g2 // again next grammar till we end up at 1000 occurrences of foo
EE
...
g999:
EE // no more SE in grammar
Note: The last grammar g999 is different since it ONLY allows for EE (EndElement) and there is no more SE (StartElement) for foo
.
Since grammars cannot count one needs this set of grammars to make sure that no more than 1000 foo
elements appear.
Let's now change the example from maxOccurs 1000
to unbounded
<xs:element name="foo" minOccurs="0" maxOccurs="unbounded" />
Now the set of grammars is different and very simple (it always points to itself).
g:
SE(foo) -> g // it points to itself
EE
Coming back to the issue I described. Since the LAST grammar g999 looks slightly different it also results in a different EXI encoding. Having said that.. this is ONLY the case for the last grammar... all other grammars .. like g998 look exactly the same and result in the same encoding.
I hope this clarifies the issue I was pointing at.
Very excellent explanation. Thanks.
I am aware of this old ticket on Sourceforge: https://sourceforge.net/p/exificient/bugs/31/
You're suggested work around of setting the maxOccurs to be unbounded will not work for our application as the schemas we are creating are used in a secure environment and need to be as restrictive as possible. Also even significant increases to the stack size did not prevent the overflow from occurring.
From an outsider's perspective it would seem that a relatively easy fix for this would be to treat maxOccurs values above some threshold (5 or 10?) as unbounded internally.
An example of one of our schemas that triggers this error can be found here: https://github.com/DFDLSchemas/NITF It has an element with a maxOccurs="999".