csingley / ofxtools

Python OFX Library
Other
301 stars 68 forks source link

Parser silently ignores CDATA sections #141

Closed rdsteed closed 2 years ago

rdsteed commented 2 years ago

The parser in OFXTOOLS is silently ignoring CDATA sections resulting in information loss.

Example test file available from ofxparse test suite: https://github.com/jseutter/ofxparse/blob/3236cfd96434feb6bc79a8b66f3400f18e2ad3c4/tests/fixtures/suncorp.ofx

With this particular test file, OFXTOOLS silently loses text enclosed within <NAME> and <MEMO> tags.

csingley commented 2 years ago

Yep, that's a bona fide bug all right. It might or might not be easy to fix; it's been a long time since I popped the hood on xml.etree.ElementTree.TreeBuilder and had a good look round.

Is Suncorp actually sending this braindead crap for OFXv1 too? Sometimes the most expedient fix for this kind of thing is to stop asking the FI to send OFXv2 responses, and stick to OFXv1 like a normal client. It very much looks like Quicken plans to obsolete the entire OFX protocol before ever upgrading to version 2.

Still though, if it's in the spec I like to handle it. Stay posted.

rdsteed commented 2 years ago

Thanks for looking at this.

I am exploring moving from OFXPARSE to OFXTOOLS. The suncorp.ofx file is part of the OFXPARSE test suite, and I used it to run OFXTOOLS through its paces.

Unfortunately, CDATA segments are legit in both OFXv1 and OFXv2. Some FI's seem to like to use it to include special characters in text. Makes it easer to have a payee named AT&T rather than AT&amp;T

csingley commented 2 years ago

Well you're welcome to kick the tires around here. I am very pleased to discover Jay's test suite, though - great idea to run it through ofxtools.

Makes it easer to have a payee named AT&T rather than AT&T

Funny you should mention that particular example, which actually appears in section 2.3.2.1 of the spec... right above the part of the spec where they mention CDATA, purely in the context of escaping HTML content (which is sensible)... they seem to have included CDATA in the spec purely to facilitate the lame <MAIL> message set. Which is to say, you'd only really expect to see CDATA in data elements that you wouldn't notice or care about being silently dropped. This would probably explain why no ofxtools user has noticed this bug for the past decade or so. You have to go out of your way to hit Suncorp's XML-native tooling, whose output probably hasn't been validated against Quicken or anything else that end users actually care about.

This is not to say that I don't care about it - complying with the spec is one of the main goals of this library.

Thanks for reporting, and thanks even more for including a test case with the bug report! If you've got any more, keep 'em coming. If I'm going to book a block of hours to yoink back a brace of Adderall and perform open-heart surgery on Fredrik Lundh's code, it'd be nice to take care of multiple issues while I've got the ribcage cracked open.

rdsteed commented 2 years ago

If you are going to do open-heart surgery on TreeBuilder, you might want to try this pure Python lexer/tokenizer recipe:

  https://code.activestate.com/recipes/65125-xml-lexing-shallow-parsing/

I've played around with it and I have never found a legit OFX v1 or v2 it failed on. And it's fast.

csingley commented 2 years ago

Neat! This ticket just keeps getting better.

I like the current ofxtools parsing implementation because it gets the job done in <75 SLOC, with decent performance, coding against a well-known and stable API (i.e. ElementTree) that helps guide future-me/other contributors. However, correctness trumps these virtues. If I can't find a decent way to parse CDATA without reimplementing half of TreeBuilder, I'll definitely be looking at alternatives such as this.

Just need to carve out some time to work on it!

BTW if you're comfortable working on this kind of stuff, I'm quite open to PRs that enhance conformance to the spec. Maybe we'll have a nice canonical reference library fully complete just in time for the entire USA to abandon OFX in favor of a closed spec we can't get access to (FDX).

rdsteed commented 2 years ago

ElementTree normally uses the expat parser which does handle cdata. Was there a reason you didn't go with this but instead used a regular expression to tokenize the ofx file? (Oops - never mind. With v1 expat fails with an unrecoverable exception.)

I can't see where special characters in in text are converted to normal text. Or does the name AT&T look like AT&amp;T ?

csingley commented 2 years ago

ElementTree normally uses the expat parser which does handle cdata

If only we could just handle XML, life would be so easy. If your goal is to handle real OFX data in the wild, you need to focus on what Quicken does. What Quicken does is to use OFXv1.0.3, which isn't XML. There is essentially no real-world OFXv2 traffic, so expat doesn't help us. We need to parse SGML.

I've thought about writing Python bindings for OpenSP or something along those lines. I don't think that would be especially easy, and it would be tougher to distribute, but it would certainly handle CDATA.

I am a profoundly lazy man, though, so my thoughts initially turn to modifying the existing TreeBuilder.regex to capture the CDATA, and then just insert it into the parse stream.

I can't see where special characters in in text are converted to normal text. Or does the name AT&T look like AT&T ?

The TreeBuilder just loads tag soup in an ElementTree data structure.

The second step is to walk the tree, converting each node to the Python representation of its type as given in the OFX spec (decimal.Decimal, datetime.datetime, etc.). The unescaping of special characters happens as part of this type conversion.

ofxtools.Parser.OFXTree.convert() is the top-level entry point. That will lead you to the awful morass of ofxtools.models.Base.Aggregate.from_etree(), thence to ofxtools.Types.String.convert(), which is where the rubber you're searching for hits the actual road.

rdsteed commented 2 years ago

Sorry for the delayed reply. I did dive into the code some more.

It turns out that expat does just fine tokenizing OFX. It is actually ElementTree's TreeBuilder that raises the exception while handling end tags:

assert self._last.tag == tag ,"end tag mismatch (expected %s, got %s)" % (self._last.tag, tag)

It should be possible to provide alternate handling in TreeBuilder for OFX's self closing tags caught by this assertion.

I also ran into an obscure bug in ElementTree itself where the C accelerators in cPython do not properly handle XMLParser instantiation. (https://bugs.python.org/issue45948) Can work around this, but it drove me crazy for a few hours.

csingley commented 2 years ago

It turns out that expat does just fine tokenizing OFX.

Progress! I never tried that.

It should be possible to provide alternate handling in TreeBuilder for OFX's self closing tags caught by this assertion.

Uh-huh, I've done some of this. However, the stack handling logic can get nasty if you're not careful. If it's not too intractable, I definitely agree that this is the right solution to the problem.

It should be possible to provide alternate handling in TreeBuilder for OFX's self closing tags caught by this assertion.

To circle back to the issue raised in that other ticket... and at the risk of lapsing into pedantry... in OFXv1, XML-style self-closing tags (e.g. <STMTRS />) are not a thing. Rather, the tricky bit is that end tags (e.g. </STMTRS>) are optional.

So if we fail the TreeBuilder assumption you quote above, we are probably looking at the opening tag of a child node to be pushed onto the stack. I think.

Thank you for looking at this... I hope to carve out a bit of time to treat it more seriously in the near future!

In the meantime, pour one on the ground for effbot:

https://lwn.net/ml/python-dev/CAP7+vJLbxJY-SPPzaoT4qnhF9MRkXDuhmr8wxZ=Es1X2ob3gMg@mail.gmail.com/

rdsteed commented 2 years ago

Point of clarification - SGML allows for tag omission (both start and end tags).

The elements in OFX 1.x which allow end tag omission are identified in the DTD (411 of them in OFX 1.6).

These elements can also be identified by the fact that they contain text which is not whitespace.

They also contain no other SGML elements (ie are not OFX aggregates)

The omission is optional - the end tags may or may not be there.

Within TreeBuilder, it should be possible to test for the presence of non white space text and/or the known 411 tags in the handlers for start and end.

csingley commented 2 years ago

Your attention here is much appreciated. Apologies for the lack of bandwidth as the fiscal year draws to a close.

The elements in OFX 1.x which allow end tag omission are identified in the DTD (411 of them in OFX 1.6) These elements can also be identified by the fact that they contain text which is not whitespace.

I'm woefully ignorant. Can you show me how to interpret the DTD in this regard?

They also contain no other SGML elements (ie are not OFX aggregates)

That is in fact correct, and what I said in my previous response about popping the stock is wrong.

Within TreeBuilder, it should be possible to test for the presence of non white space text and/or the known 411 tags in the handlers for start and end.

Your understanding of the clear distinction between container "aggregates" and text-data-bearing "elements" is on point. The former test you mention (i.e. nonempty text) seems preferable for a number of reasons - guarantees of correctness baked into the spec, implementation simplicity, extensibility, not to mention performance/efficiency.

rdsteed commented 2 years ago

Interpreting DTD: Here is as sample line from the DTD: <!ELEMENT BALOPEN - o %AMTTYPE> BALOPEN The element name. - o The hyphen indicates the start tag is needed. the 'o' indicates the end tag may be omitted. '%AMTTYPE The type of contents for this element.

I don't have a strong opinion about non empty text vs tag lookup to detect omittable end tags. I do note that extensibility may not matter since there won't be any more SGML based versions.

FYI I've attached a comma delimited file of elements in v1.6 with omittable end tags. First field is tag, second field is type. OptionalEndTags.csv .

csingley commented 2 years ago

The hyphen indicates the start tag is needed. the 'o' indicates the end tag may be omitted.

Thanks!

I don't have a strong opinion about non empty text vs tag lookup to detect omittable end tags. I do note that extensibility may not matter since there won't be any more SGML based versions.

There's the OFXv2 spec, which we do support. I think it's about wound up now... but less trivially, all versions of the spec do support proprietary tags that aren't in the DTD (e.g. <INTU.BID>). We DON'T support these at present (they're silently ignored) but it's important that they not crash the parser. Note, though, that these proprietary tags can be discriminated by the presence of the dot deliminating the namespace within.... which could be used to overcome this problem if need be.

But as you point out, the strong structural dimorphism present in the OFX taxonomy means that we should be able to simply solve the missing end-tag problem within the TreeBuilder, and still defer knowing anything at all about the tag semantics until the second pass when the tree gets validated & type-converted. I'm advocating for maintaining a pristine decoupling, and keeping the ofxtools model classes as the single source of truth for tag structure.

But, y'know, whatever works best.

rdsteed commented 2 years ago

The spec does allow the proprietary tags like <INTU.BID>, but it also mandates they always have end tags, even in the SGML based 1.xx versions. However, I just did a quick internet search and found a sample file with a <INTU.BID> start tag and no end tag. So I agree it would be more robust to test for the presence of non white space characters.

Also, going back to the earlier question about empty tags <tag />, expat will spit that out as a start and and end with no data. An element tree will get built. Then it would be a problem for validation and type-conversion at the next step.

csingley commented 2 years ago

Sounds like we're in violent agreement then.

Also, going back to the earlier question about empty tags , expat will spit that out as a start and and end with no data. An element tree will get built. Then it would be a problem for validation and type-conversion at the next step.

This should be fine.

Currently the regex parses that as TAG /, whereas expat should parse it as TAG. The former is more in line with the spec. The latter will actually get processed the way the sending FI intended, which should make life a little simpler for aclindsa .

I'm not really worried too much about grossly erroneous data such as this. The priority is definitely to replace my regex kludge with a more efficient parsing engine. "Step 1: load the whole dataset into memory; step 2: feed the whole shebang to a regex parser" is a pretty nasty prelude to an algorithm.

Picking up CDATA in the bargain would be the cherry on top.

rdsteed commented 2 years ago

Hit a snag with expat. Turns out it has it's own mismatch issues that cannot be overcome by spoofing end tags in the TreeBuilder. Looks like a deal breaker for this approach. Sorry.

csingley commented 2 years ago

Bummer. I'll still try to fix it with a regex that will pick up CDATA.

csingley commented 2 years ago

OK, I've got more bandwidth to tackle this. An initial approach to handling CDATA is in e91ab8f.

For simplicity, it's based on an assumption that no element will contain both CDATA and regular text. Does that seem reasonable to you?

This modified parser will handle Jay's suncorp.ofx test fixture. He's got some other cases in there that it won't handle... bank_small.ofx, checking.ofx, and ofx-v102-empty-tags.ofx... but those are straight-up illegal AFAICT (although I wouldn't mind relaxing the length check on the account number to throw a warning instead of an error).

rdsteed commented 2 years ago

I'm inclined to think that CDATA is fairly rare, but probably best to assume it may be mixed. One of the use cases in the documentation was for adding spaces in data before &nbsp; was available in later versions. This was particularly useful for leading or trailing spaces which would be normally be stripped.

Possibly the reasonable thing is to not drop CDATA silently, but provide a warning that it may not be handled correctly.

csingley commented 2 years ago

Quicken was written to version 1.0.3, which doesn't mention anything about leading/trailing spaces in the absence of &nbsp;. Also I can't find anything about CDATA in the DTD... indeed, I see the fundamental data types defined as PCDATA instead.

Here's the entirety of the guidance I can find on CDATA in version 1.0.3.

OFXv103, section 2.3.2:

NOTE: Escape sequences are not required when these special characters are used in tag values that accept HTML-formatted strings (for instance, e-mail records). These tags accept SGML-marked section syntax that hides the HTML from the Open Financial Exchange parser. You must prefix the HTML-formatted strings with “<![ CDATA ]”and suffix them with “]]>.” Within these bounds, treat the above characters literally without an escape. See Chapter 9 for an example.

OFXv103, section 9.2:

NOTE: When using HTML for the message body, clients and servers are REQUIRED to enclose the HTML in an SGML-marked section to protect the HTML markup: <![ CDATA [ ... html ... ]]>. For an example, see section 9.2.5.

The syntax examples provided conform to this discussion; there is no mixing of PCDATA & CDATA within the same element value.

By the time they released OFXv1.6, they'd added the following language to section 1.2.2:

If white space is desired preceding or following an element value, this is achieved using the CDATA wrapper or, in version 2 message sets and Bill Presentment, the &nbsp macro.

And they added this language to 2.3.2:

Note: The space macro was not added until OFX 1.5. So when using a version 1 message set (excluding Bill Presentment), there may be both clients and servers that can not process the   syntax. In this case, you must prefix strings containing space characters as the first or last characters with “<![CDATA[”and suffix them with “]]>.” See Chapter 9, "Customer to FI Communication," for an example. Note: Space characters in the middle of a value do not require use of the special character macro. If a string value needs to contain space characters as the first or last characters, that’s when this macro is needed.

...which seems to be the usage employed by Suncorp in the example from Jay's test suite.

In summary: all extant references to CDATA in the spec appear to use either explicit "prefix"/"suffix" language, or else refer to a CDATA "wrapper", which means the same thing.

It's also the case that the Suncorp example is the only CDATA I've ever seen, in more than a decade of fielding reports about people's screwed-up OFX data... and Suncorp conforms to this usage, too, even though they're doing weird & unnecessary shit not involving HTML.

If we actually see some real-world data that features sudden transitions between CDATA and PCDATA within a single element value... which would be very strange, for a use case involving space-padding or HTML... then we can revisit the issue.

csingley commented 2 years ago

Closing this for now; anybody pls reopen if you encounter CDATA that breaks the parser.