FasterXML / jackson-dataformat-xml

Extension for Jackson JSON processor that adds support for serializing POJOs as XML (and deserializing from XML) as an alternative to JSON
Apache License 2.0
561 stars 221 forks source link

`ArrayIndexOutOfBoundsException` thrown for invalid ending XML string when using JDK default Stax XML parser #618

Closed arthurscchan closed 6 months ago

arthurscchan commented 7 months ago

In XmlTokenStream::_collectUntilTag method, there is an infinite while loop to loop through the provided XML string (through _xmlReader) character by character. The loop only exits by return statements when a valid character (XMLStreamConstants.START_ELEMENT, XMLStreamConstants.END_ELEMENT or XMLStreamConstants.END_DOCUMENT) is found. If the provided XML string is invalid without those characters, it will continue to loop through the whole XML String and eventually throw ArrayIndexOutOfBoundsException when _xmlReader has no more characters that can be returned by the next() method. Besides, there are also some other methods depends on those END_ELEMENT to stop looping out-of-bound. The suggested fix could be simply wrapping the ArrayIndexOutOfBoundsException with the JsonParseException.

        CharSequence chars = null;
        while (true) {
            switch (_xmlReader.next()) {
            case XMLStreamConstants.START_ELEMENT:
                return (chars == null) ? "" : chars.toString();

            case XMLStreamConstants.END_ELEMENT:
            case XMLStreamConstants.END_DOCUMENT:
                return (chars == null) ? "" : chars.toString();

            // note: SPACE is ignorable (and seldom seen), not to be included
            case XMLStreamConstants.CHARACTERS:
            case XMLStreamConstants.CDATA:
                // 17-Jul-2017, tatu: as per [dataformat-xml#236], need to try to...
                {
                    String str = _getText(_xmlReader);
                    if (chars == null) {
                        chars = str;
                    } else  {
                        if (chars instanceof String) {
                            chars = new StringBuilder(chars);
                        }
                        ((StringBuilder)chars).append(str);
                    }
                }
                break;
            default:
                // any other type (proc instr, comment etc) is just ignored
            }
        }
    }

We found this issue by OSS-Fuzz and it is reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=64655, https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=64667 and https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=64659.

cowtowncoder commented 7 months ago

Just to be pedantic: loop is not character-by-character but token-by-token. Tokens refer to things like start element (like <tag>) and text segments (CDATA).

But I agree with the fix wrt checking that loop ends if there are no more tokens available.

cowtowncoder commented 7 months ago

After cleaning things up, including removal of try-catch block, this is not reproducible by given test cases.

Looking at https://oss-fuzz.com/testcase-detail/5439718040666112, however, it looks like test is running using JDK Stax implementation (sjsxp) which is not what this module is configured to use -- Woodstox. The problem with this stack trace, however, is that SJSXP code is one throwing exception, not Jackson. As such I am tempted to consider this a problem with OSS-Fuzz set up: it probably should include Woodstox as dependency. No one should really default to SJSXP as it's known to have remaining bugs, not be nearly as XML compliant as Woodstox, and is slower as well.

Now: it'd be nice (but IMO not required) to also work on SJSXP to some degree. I'll see if I can tease out reproduction: possibly by temporarily removing Woodstox dependency. But I don't know yet how to be able to have alternate CI run with such set up (I know how to override dependency versions but not sure how to exclude).