Stack overflow resulting from large maxOccurs values

jadams-tresys commented 2 years ago

I am aware of this old ticket on Sourceforge: https://sourceforge.net/p/exificient/bugs/31/

You're suggested work around of setting the maxOccurs to be unbounded will not work for our application as the schemas we are creating are used in a secure environment and need to be as restrictive as possible. Also even significant increases to the stack size did not prevent the overflow from occurring.

From an outsider's perspective it would seem that a relatively easy fix for this would be to treat maxOccurs values above some threshold (5 or 10?) as unbounded internally.

An example of one of our schemas that triggers this error can be found here: https://github.com/DFDLSchemas/NITF It has an element with a maxOccurs="999".

danielpeintner commented 2 years ago

From an outsider's perspective it would seem that a relatively easy fix for this would be to treat maxOccurs values above some threshold (5 or 10?) as unbounded internally.

An example of one of our schemas that triggers this error can be found here: https://github.com/DFDLSchemas/NITF It has an element with a maxOccurs="999".

This is correct, we could do so but that also means you deviate from the EXI specification which requires you do build all 998 grammars (which are the same) and the last one which is different. Having said that, changing 999 to unbounded makes a valid EXI stream for all occurrences up to 998. Only the last occurrence (based on the EXI specification) looks different.

Did you try to increase the Increase Thread Stack Size (-Xss) in Java?

jadams-tresys commented 2 years ago

I did try to increase the stack size (setting it to 20M, which I believe is 20x the default). This simply made it take longer before overflowing.

I was unaware that the EXI spec mandated building all of the grammars. Would it be possible to simply reuse the same grammar for the first 998 and then create 1 more for the last one, or would that be considered breaking from the spec as well?

danielpeintner commented 2 years ago

One may argue it is an implementation issue. I am pretty sure one can also implement it differently how grammars work in EXIficient. In its current fashion grammars may point to other grammar states without counting how often a given element has been seen before.

Comment: You could also consider using a slightly different schema without maxOccurs="999" for EXI encoding/decoding (if possible).

Moreover, I don't think it is good practice to use such high limits. Let's look at Xerces which is maybe the most prominent XML validator in the Java World.

grafik

My Oxygen IDE is similar..

grafik

Note: I looked at https://github.com/DFDLSchemas/NITF and the schemas you provided. They are not complete.. I think (see file nitf_extension_types.dfdl.xsd which points to <xs:import schemaLocation="com/mitre/jpeg/xsd/jpeg.dfdl.xsd" /> which does not exist)

mbeckerle commented 1 year ago

@danielpeintner the missing JPEG schema is at https://github.com/DFDLSchemas/JPEG.

(Lots of our schemas are layered like this container/payload idiom where one schema includes/imports numerous others that come from completely separate projects.)

Re: unbounded vs. 999 or other finite limit.

In the cyber security world, our products must statisfy certifying authorities who use tools that inspect XML schemas and reject anything unbounded in XML schemas as a potential security vulnerability to a denial-of-service attack.

For example, for https://github.com/DFDLSchemas/mil-std-2045 schema, there are 7 uses of "unbounded" and we are required to remove them all.

In the short term it would be perfectly fine if Exificient had a threshold where above a certain limit, the maxOccurs would be treated by Exificient, as if it were 'unbounded'. I don't know the code base, but this seems like a relatively small change to add along with a way to turn this behavior on.

Does this sound feasible to you?

danielpeintner commented 1 year ago

(Lots of our schemas are layered like this container/payload idiom where one schema includes/imports numerous others that come from completely separate projects.)

Re: unbounded vs. 999 or other finite limit.

Unfortunately this has been turned out to be an issue and was (is?) on the "ideas" list of an updated EXI version. see https://www.w3.org/XML/EXI/wiki/EXI2#.28Improvement.29_Memory-sensitive_EXI_Grammar_approach

Whether there will be ever a new version I cannot say.

In the short term it would be perfectly fine if Exificient had a threshold where above a certain limit, the maxOccurs would be treated by Exificient, as if it were 'unbounded'. I don't know the code base, but this seems like a relatively small change to add along with a way to turn this behavior on.

Does this sound feasible to you?

Technically it is not a big issue. In fact it could be easily done by replacing any "larger number" with "unbounded" when reading XML schema files. Having said that, this does not exist yet in the code and I am not sure if it should be part of the code. It could be easily put on top by creating the EXI grammars yourself, or modifying the InputStream before passing it https://github.com/EXIficient/exificient-grammars/blob/77413b7978c2b6fa4a97ab0c3debb16f12d2712f/src/main/java/com/siemens/ct/exi/grammars/GrammarFactory.java#L106

Note: There is still the very low likelihood that an encoding with maxOccurs="1000" (changed to unbounded) differs in the case there are actually 1000 appearances..

mbeckerle commented 1 year ago

@danielpeintner can you explain this statement just a bit further: Note: There is still the very low likelihood that an encoding with maxOccurs="1000" (changed to unbounded) differs in the case there are actually 1000 appearances.. When maxOccurs is N and there are N occurrences, how could that be different from N occurences when maxOccurs is unbounded?

I looked at that w3 page, and that discussion does not mention such an issue, or I didn't see it. It suggests that treating all maxOccurs > 1 as 'unbounded' is plausible.

danielpeintner commented 1 year ago

Let me use an XML example element (of an XML schema) to describe the differences.

<xs:element name="foo" minOccurs="0" maxOccurs="1000" />

The EXI spec states clearly that for each state there is a grammar

g0:
  SE(foo) -> g1    // it points to the next grammar
  EE 

g1:
  SE(foo) -> g2   // again next grammar till we end up at 1000 occurrences of foo
  EE 

...

g999:
  EE      // no more SE in grammar

Note: The last grammar g999 is different since it ONLY allows for EE (EndElement) and there is no more SE (StartElement) for foo. Since grammars cannot count one needs this set of grammars to make sure that no more than 1000 foo elements appear.

Let's now change the example from maxOccurs 1000 to unbounded

<xs:element name="foo" minOccurs="0" maxOccurs="unbounded" />

Now the set of grammars is different and very simple (it always points to itself).

g:
  SE(foo) -> g    // it points to itself
  EE

Coming back to the issue I described. Since the LAST grammar g999 looks slightly different it also results in a different EXI encoding. Having said that.. this is ONLY the case for the last grammar... all other grammars .. like g998 look exactly the same and result in the same encoding.

I hope this clarifies the issue I was pointing at.

mbeckerle commented 1 year ago

Very excellent explanation. Thanks.

EXIficient / exificient

Stack overflow resulting from large maxOccurs values #31