FasterXML / woodstox

The gold standard Stax XML API implementation. Now at Github.
Apache License 2.0
220 stars 81 forks source link

IllegalStateException on Closing Tags #113

Open Josh-ES opened 3 years ago

Josh-ES commented 3 years ago

When parsing an XML feed like the below, with an unbalanced close tag, an error gets thrown at the following line which can be caught, and parsing is able to continue:

https://github.com/FasterXML/woodstox/blob/489149b9f9d5d5a333fdf325513c08b00dd3ea17/src/main/java/com/ctc/wstx/sr/BasicStreamReader.java#L3337

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
  <CD>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1985</YEAR>
  </CD>
  </CD>
  <CD>
    <TITLE>Hide your heart</TITLE>
    <ARTIST>Bonnie Tyler</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>CBS Records</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1988</YEAR>
  </CD>
  <CD>
    <TITLE>Greatest Hits</TITLE>
    <ARTIST>Dolly Parton</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>RCA</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1982</YEAR>
  </CD>
  <CD>
    <TITLE>Still got the blues</TITLE>
    <ARTIST>Gary Moore</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Virgin records</COMPANY>
    <PRICE>10.20</PRICE>
    <YEAR>1990</YEAR>
  </CD>
</CATALOG>

If, however, an XML feed with an unbalanced close tag that has no preceding white space is encountered, a second exception is thrown which cannot be recovered from.

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
  <CD>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1985</YEAR>
  </CD></CD>
  <CD>
    <TITLE>Hide your heart</TITLE>
    <ARTIST>Bonnie Tyler</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>CBS Records</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1988</YEAR>
  </CD>
  <CD>
    <TITLE>Greatest Hits</TITLE>
    <ARTIST>Dolly Parton</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>RCA</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1982</YEAR>
  </CD>
  <CD>
    <TITLE>Still got the blues</TITLE>
    <ARTIST>Gary Moore</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Virgin records</COMPANY>
    <PRICE>10.20</PRICE>
    <YEAR>1990</YEAR>
  </CD>
</CATALOG>

What's happening is that if such an unbalanced close tag error is caught, and then reader.next() is called subsequently, the stream reader's current event type continues to be set to whatever the previous event type was. If an unbalanced close tag was preceded by some whitespace characters, the current event type is set to characters i.e. mCurrToken == XMLEvent.CHARACTERS. Calling reader.next() reads the unbalanced close tag as if they were characters with no problem, and then parsing can continue from there.

If the unbalanced close tag was preceded by another close tag, it treats the unbalanced close tag as a close tag and ends up hitting the below line:

https://github.com/FasterXML/woodstox/blob/489149b9f9d5d5a333fdf325513c08b00dd3ea17/src/main/java/com/ctc/wstx/sr/BasicStreamReader.java#L2737

This throws an IllegalStateException from the line below. If you try to call reader.next() after catching this error, it continues trying to parse the unbalanced close tag as a close tag, and continuously throws this error.

https://github.com/FasterXML/woodstox/blob/489149b9f9d5d5a333fdf325513c08b00dd3ea17/src/main/java/com/ctc/wstx/sr/InputElementStack.java#L372

Our use case is unusual, as we are trying to parse XML documents in fragments and so encounter this case quite regularly. Right now, our workaround is to extend from the Woodstox stream reader and manually set the event type to characters if we encounter this error. That doesn't sound like a solution that would work for all use cases, but it works for us. However, shouldn't it be possible to skip over a close tag if no corresponding opening tag has been seen by the parser, with the default Woodstox parser? If anyone has an idea on how to solve this, I'd be happy to contribute a PR.

try {
    return next();
} catch (IllegalStateException ex) {
    if (!ex.getMessage().equals("Popping from empty stack")) {
        throw ex;
    }
    mCurrToken = XMLEvent.CHARACTERS;
    return mCurrToken;
}
cowtowncoder commented 3 years ago

One quick comment just to make sure I understand this correctly: XML parsers are required to verify well-formedness, so quietly skipping "unmatched" close tags is a no-no and should not be the default behavior. But it is fine to have ways to configure things to allow that for use cases where it is desired (either via new property or, probably better, through error handler).

So... It would be nice of course to make it easier to recover from issues like this, although in general it is not something that can be solved for every case (and cost of trying to keep state recoverable in all cases is pretty steep) But in this particular case perhaps there could be something as simple as reordering state update or something -- and if so, I would accept a patch. And I think that first figuring out where handling would work makes sense, after which specific configuration mechanism can be decided on.

cowtowncoder commented 2 years ago

I don't have time or interest to work on this issue, but if someone wants to try come up with a change to allow better handling of non-well-formed content, would be happy to help get such feature merged.