dinhvh / libetpan

Mail Framework for C Language
www.etpan.org
Other
612 stars 283 forks source link

Excessive quoting in mailmime_quoted_printable_write_driver() #66

Closed pprindeville closed 11 years ago

pprindeville commented 11 years ago

I'm looking at all of the characters that get written in hex in mailmime_quoted_printable_write_driver() and it seems that this routine is overly aggressive (or pessimistic) in what gets rewritten:

int mailmime_quoted_printable_write_driver(int (* dowrite)(void , const char , size_t), void * data, int * col, int istext, const char * text, sizet size) { ... case '!': case '"': case '#': case '$': case '@': case '[': case '\': case ']': case '^': case '`': case '{': case '|': case '}': case '~': case '=': case '?': case '': case 'F': /_ there is no more 'From' at the beginning of a line / r = write_remaining(do_write, data, col, &start, &len); if (r != MAILIMF_NO_ERROR) return r; start = text + i + 1;

        snprintf(hexstr, 6, "=%02X", ch);

        r = mailimf_string_write_driver(do_write, data, col, hexstr, 3);
        if (r != MAILIMF_NO_ERROR)
          return r;
        i ++;
        break;

I'm not sure why anything OTHER than '=' needs to be escaped?

All of these are 7-bit safe characters from NVT.

I'd be inclined to remove everything but '='.

As that's all that's really required. Per RFC-2045, section 6.7 "Quoted-Printable Content-Transfer-Encoding":

(2)   (Literal representation) Octets with decimal values of
      33 through 60 inclusive, and 62 through 126, inclusive,
      MAY be represented as the US-ASCII characters which
      correspond to those octets (EXCLAMATION POINT through
      LESS THAN, and GREATER THAN through TILDE,
      respectively).

Although reading RFC-1521 (now obsolete), "Appendix B -- General Guidelines For Sending Email Data" says:

  (6) Many mail domains use variations on the ASCII character set,
  or use character sets such as EBCDIC which contain most but not
  all of the US-ASCII characters.  The correct translation of
  characters not in the "invariant" set cannot be depended on across
  character converting gateways.  For example, this situation is a
  problem when sending uuencoded information across BITNET, an
  EBCDIC system.  Similar problems can occur without crossing a
  gateway, since many Internet hosts use character sets other than
  ASCII internally.  The definition of Printable Strings in X.400
  adds further restrictions in certain special cases.  In
  particular, the only characters that are known to be consistent
  across all gateways are the 73 characters that correspond to the
  upper and lower case letters A-Z and a-z, the 10 digits 0-9, and
  the following eleven special characters:

                    "'"  (ASCII code 39)
                    "("  (ASCII code 40)
                    ")"  (ASCII code 41)
                    "+"  (ASCII code 43)
                    ","  (ASCII code 44)
                    "-"  (ASCII code 45)
                    "."  (ASCII code 46)
                    "/"  (ASCII code 47)
                    ":"  (ASCII code 58)
                    "="  (ASCII code 61)
                    "?"  (ASCII code 63)

  A maximally portable mail representation, such as the base64
  encoding, will confine itself to relatively short lines of text in
  which the only meaningful characters are taken from this set of 73
  characters.

Are we really worried about EBCDIC and X.400 compatibility???? This section continues as:

  (7) Some mail transport agents will corrupt data that includes
  certain literal strings.  In particular, a period (".") alone on a
  line is known to be corrupted by some (incorrect) SMTP
  implementations, and a line that starts with the five characters
  "From " (the fifth character is a SPACE) are commonly corrupted as
  well.  A careful composition agent can prevent these corruptions
  by encoding the data (e.g., in the quoted-printable encoding,
  "=46rom " in place of "From " at the start of a line, and "=2E" in
  place of "." alone on a line.

Please note that the above list is NOT a list of recommended practices for MTAs. RFC 821 MTAs are prohibited from altering the character of white space or wrapping long lines. These BAD and illegal practices are known to occur on established networks, and implementations should be robust in dealing with the bad effects they can cause.

Given that converting punctuation into HEX is NOT a recommended practice, why are we doing it anyway?

Some of these concerns were a lot more relevant in 1993 when this was written. These days the "^From " bugs, etc. are anachronisms.

dinhvh commented 11 years ago

We are generating the initial message. We are not altering anything. Then, I think that's fine. For now, it works properly. I'm sure sure that's worth trying to change the behavior to gain an non-significant amount of bytes.

From is still relevant because emails are usually stored in /var/mail/username in mbox format on Unix by default and using that encoding will avoid an alteration of the email while storing it to that location.

Moreover, I'm also using quoted printable for headers encoding an other constraints will apply in that case.

My question is the following: What do you want to improve by changing the behavior of that encoding?

pprindeville commented 11 years ago

I'm not really worried about the number of bytes.

I just don't see the point of making a conservative workarounds for 2 scenarios which are almost certainly overtaken by events.

I personally haven't seen an EBCDIC terminal in more than 22 years.

The last commercial X.400 email services I knew of (Transpac, Compuserve, and HP OpenMail) all died more than a decade ago. Even Exchange support for X.400 was dropped in 2007.

Everything speaks NVT/US-ASCII/UTF-8 these days.

dinhvh commented 11 years ago

Thanks for your concern but that might be a dangerous change. Especially, see restriction on RFC 2047, page 6. It just works. Don't break it.

Again, my question is: What do you want to improve by changing the behavior of that encoding?

pprindeville commented 11 years ago

RFC 2047, page 6, is about headers. I'm talking about encoding the message body.

No such restrictions apply to the body.

Also, the title of that RFC is "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text"... All of the characters above (EXCLAMATION POINT, AMPERSAND, UNDERSCORE, etc) are ALL ASCII text, so what's the relevance of Non-ASCII text to this issue?

dinhvh commented 11 years ago

Sure but I use one unique encoding method that works for both. It helps having something stable.

I quote again my question: What do you want to improve by changing the behavior of that encoding?

pprindeville commented 11 years ago

As a S/W engineer and protocol developer I like being able to read raw messages and diagnose issues; this is why SMTP + RFC-2822 is so much better than X.400 and ASN.1... I don't need a scope or a protocol analyzer to troubleshoot it.

The minute that the encoding starts getting gummed up with excessive escaping then it becomes harder to read and troubleshoot.

The only non-control ASCII characters that need to be hexified are '=' and DEL, from a minimalist encoding perspective. Encoding curly braces and tilde might help with EBCDIC... if you actually have an EBCDIC gateway to deal with (and even if you do, the burden should be on them to do the proper transcoding, not us).