Does not correctly ignore 'bad' base64 whitespace characters RFC-2045

Spiral90210 commented 2 years ago

Honestly, I'm surprised to see this. Got a really unusual error about being unable to decode a base64 transfer encoded part today. The part in question was an image that was wrapped at 998 chars long but indented with a continuation white space, which breaks the base64 decoder.

This is obviously a bug on the author of the mail part. HOWEVER I went and read the spec, RFC-2045 6.8 () and the following is stated:

The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances.

So, that's 2 counts that the part is breaking the spec on (line length and invalid characters) BUT I read this paragraph that we should be able to ignore this padding white space, and decode the part.

I believe this can be achieved by imitating what the base64 decoder does internally with it's newlineFilteringReader, we can wrap the raw reader in one that will do exactly the same, but include FWS (folding white space) characters in the list, so space and tab characters.

Would a PR for this be acceptable?

Spiral90210 commented 2 years ago

Well, I made one anyway!

Spiral90210 commented 2 years ago

Guessing there's not much appetite fo this, so I'm going to submit another that will allow registering custom decoders, similar to how charsets can be mapped - this way the custom decoder can be in the consuming application, and we can handle things that are broken however we need to so this repo can stay opioniated.

emersion commented 2 years ago

Nah, sorry, I don't like this approach.

Spiral90210 commented 2 years ago

That's fair enough, doesn't exactly give me good feelings either but email being email there is all sorts of non spec stuff flying around that is a nightmare to handle (found some microsoft stuff the other day that has a charset of 'unicode' - which turns out to be LE utf-16). Think registering custom decoders would be alright? I'm just about to start testing it now, see if it works ok.

emersion commented 2 years ago

I mean that I don't like the custom decoder approach. I prefer https://github.com/emersion/go-message/pull/152.

Spiral90210 commented 2 years ago

Oh dang it I got confused between PR's and issues 🙄 - sorry I've not replied to that one in a while, I've been using it from my fork and it's been working away so I had to switch to other stuff for a while.

The reason I was even suggesting decoders is that I've found another piece of awful content that needs to be handled, and it's not as clear cut as the b64 one. I have some html content (technically malformed html) where the writer seems to have forgotten it's writing QP encoding, terminating the line with = \r\n (space beteen the = and CRLF). I was taking a similar approach by modifying the existing decoder to be more tolerant of this particular error but although its working ok in my testing, it's a bit hacky and I've not the time to do it properly - I figured this way I'd be able to keep awful code out of the repo!

Spiral90210 commented 2 years ago

Its not that they do this every new line, it's like they just... stopped writing their html? It's weird, literally looks like ght=3D"2" width=3D"2" alt= \r\n, doesn't close the tag it's started, just closing the outer. Nightmare.

emersion / go-message

Does not correctly ignore 'bad' base64 whitespace characters RFC-2045 #151