Improve Quoted-Printable encoding

Maria-12648430 commented 3 years ago

The current implementation of encode_quoted_printable/3 is a bit crude (no offense, really 😜), especially when it comes to lines reaching the limit of 76 characters, resorting to string functions to retrieve the last line and finding whitespaces in there.

The improved version encode_quoted_printable/6 which I'm proposing here works by remembering the occurences of linear whitespaces (including HTABs), and otherwise only look-aheads in the original data to see if a CRLF follows the current character. I also refactored the encoding of characters itself to work via integer operations only, thereby getting rid of all the io:format and lists:flatten calls.

I admit that my implementation isn't exactly easy on the eye, but that is because QP-encoding itself is tricky. And IMO, the current implementation isn't that easy to figure out, either 😉

I left the current tests untouched in what they test, only replaced the calls to the helper function encode_quoted_printable/3 (of which I don't see the point) with calls to the API function encode_quoted_printable/1. I also removed one test, "newline craziness", of which I also fail to see the point 🤔

Finally, I put in a micro-optimization in choose_transformation/1 which avoids division and rounding and only uses multiplication to figure out if the percentage of printable characters in the sample is above 80.

[EDIT]: Just noticed an edge case in which a line could exceed the length limit :laughing: Will fix and update the PR tomorrow.

Maria-12648430 commented 3 years ago

The current quoted_printable has some kind of bug where it strips some trailing(?) whitespaces and tabs. Do you think the new implementation fixes that (so Trim calls could be removed)?

Yes, indeed 😄 The problem with the current encoder is that it does not fulfill the requirements of (3) of RFC 2045 Section 6.7 fully: while it does encode WSPs that are followed by CRLF, it does not encode them when they are the last character in the body. The new implementation handles that properly.

mworrell commented 3 years ago

Nice work!

@seriyps do you want to address the Trim in that one test or shall we accept this as-is?

Maria-12648430 commented 3 years ago

@mworrell working on an update for the PR, almost done. I'll remove the Trim along with that, it's not needed any more, encoding and decoding will match up 😄

Maria-12648430 commented 3 years ago

Last commit fixes the line length bug I had in my first implementation, adds some tests for a few edge cases, augments another test to ensure that whitespaces are encoded when last character, and removes the trimming from the proptest.

Maria-12648430 commented 3 years ago

One last micro-optimization before I leave this in your hands, in choose_transformation: The current implementation determines the number of allowed ASCII characters by calling length/1 on the result of a filtering list comprehension. I changed this to call byte_size/1 on the result of an equivalent binary comprehension. byte_size is a constant-time operation, while length is proportional to the number of list elements.

mworrell commented 3 years ago

I am very happy with this. Tests are also looking good.

@seriyps shall we merge?

seriyps commented 3 years ago

Yes, looks good!

mworrell commented 3 years ago

Merged, thank you @Maria-12648430 !

juhlig commented 3 years ago

Impressive 😮

gen-smtp / gen_smtp

Improve Quoted-Printable encoding #292