issuefiler commented 4 years ago

     CR          =  <ASCII CR, carriage return>  ; (     15,      13.)
     LF          =  <ASCII LF, linefeed>         ; (     12,      10.)
     SPACE       =  <ASCII SP, space>            ; (     40,      32.)
     HTAB        =  <ASCII HT, horizontal-tab>   ; (     11,       9.)
     <">         =  <ASCII quote mark>           ; (     42,      34.)
     CRLF        =  CR LF

     LWSP-char   =  SPACE / HTAB                 ; semantics = SPACE

     linear-white-space =  1*([CRLF] LWSP-char)  ; semantics = SPACE
                                                 ; CRLF => folding

From RFC-2047:

2. Syntax of encoded-words

   An 'encoded-word' is defined by the following ABNF grammar.  The
   notation of RFC 822 is used, with the exception that white space
   characters MUST NOT appear between components of an 'encoded-word'.

   encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

   (=?ISO-8859-1?Q?a?= b)                      (a b)

           Within a 'comment', white space MUST appear between an
           'encoded-word' and surrounding text.  [Section 5,
           paragraph (2)].  However, white space is not needed between
           the initial "(" that begins the 'comment', and the
           'encoded-word'.

   (=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)     (ab)

           White space between adjacent 'encoded-word's is not
           displayed.

   (=?ISO-8859-1?Q?a?=  =?ISO-8859-1?Q?b?=)    (ab)

        Even multiple SPACEs between 'encoded-word's are ignored
        for the purpose of display.

   (=?ISO-8859-1?Q?a?=                         (ab)
       =?ISO-8859-1?Q?b?=)

           Any amount of linear-space-white between 'encoded-word's,
           even if it includes a CRLF followed by one or more SPACEs,
           is ignored for the purposes of display.

encoded-words can be separated by any amount of linear-space-white and such linear-space-whites are ignored when decoding.

Problem

When an email header is given as follows,

Subject: =?utf-8?B?YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWg==?=
    =?utf-8?B?MDEyMzQ1Njc4OQ==?=

go-guerrilla gives abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789, when the correct reading is abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.

Causes

ReadMIMEHeader puts a space unfolding folded email headers.

func (*Reader) ReadMIMEHeader

func (r *Reader) ReadMIMEHeader() (MIMEHeader, error) ReadMIMEHeader reads a MIME-style header from r. The header is a sequence of possibly continued Key: Value lines ending in a blank line. The returned map m maps CanonicalMIMEHeaderKey(key) to a sequence of values in the same order encountered in the input.

For example, consider this input:
My-Key: Value 1
Long-Key: Even
       Longer Value
My-Key: Value 2
Given that input, ReadMIMEHeader returns the map:
map[string][]string{
  "My-Key": {"Value 1", "Value 2"},
  "Long-Key": {"Even Longer Value"},
}

https://github.com/flashmob/go-guerrilla/blob/51f7dda326b1e9878e5f679ccb34a134127951b0/mail/envelope.go#L135-L140

MimeHeaderDecode doesn’t separate encoded-words by linear-space-whites but a single SPACE (Line #233), and always preserve such separators (Line #240).

https://github.com/flashmob/go-guerrilla/blob/51f7dda326b1e9878e5f679ccb34a134127951b0/mail/envelope.go#L233-L240

issuefiler commented 4 years ago

TL; DR: MimeHeaderDecode not collapsing encoded-word separators.

issuefiler commented 4 years ago

An in-the-wild example (guerrillamail.com)

The email of ID 8d62ebc4c6a8ded43a4b553a5013be1d@grr.la. The space isn’t supposed to be there.

A wrongly put space (November 22, 2019 )

Subject: =?iso-2022-jp?B?GyRCIVpLXEZ8Om89fCFbPEIkT0lUOk5NUSROJU0lPyROSn0bKEI=?=  =?iso-2022-jp?B?GyRCJCxCPyQkJEckORsoQg==?=

flashmob commented 4 years ago

Thanks.

Todo:

Modify the state machine in MimeHeaderDecode to consume the linear white-space and not preserve it. (Add an additional state that consumes it)
Test case for the above Japanese example, and fix any existing test cases if broken

flashmob commented 4 years ago

See #202

MimeHeaderDecode has been reworked. It now skips any space/tabs if another encoded-word is ahead. (Also, optimized it not to allocate any buffers if no encoded words are found. Also eliminated the WriteByte(byte) calls!)

flashmob / go-guerrilla

“MimeHeaderDecode” (envelope.go) returns an incorrectly-spaced string. #195

Problem

Causes

func (*Reader) ReadMIMEHeader

An in-the-wild example (guerrillamail.com)