4teamwork / opengever.core

OneGov GEVER core package
11 stars 5 forks source link

Zip export seems to convert line endings. #5285

Open deiferni opened 5 years ago

deiferni commented 5 years ago

Harvest: OneGov GEVER - Softwarepflege (Refactoring/Bug)

When exporting eml files in a Zip the line endings seem to be rewritten. I suspect we don't always write in binary mode 🤔.

deif@ae35 ~/Downloads $ file vor_zip.eml
vor_zip.eml: SMTP mail text, ASCII text, with CRLF line terminators
deif@ae35 ~/Downloads $ file rrr/nach_zip.eml
rrr/nach_zip.eml: SMTP mail text, ASCII text

See also https://extranet.4teamwork.ch/support/gever-ai/tracker/330

njohner commented 5 years ago

here a branch where I added a test mail with CRLF line endings: https://github.com/4teamwork/opengever.core/tree/nj_zip_crlf

I can confirm that the line endings get modified when writing the zip, but not when downloading the file from GEVER. Not sure yet why that happens.

njohner commented 5 years ago

Ha, it actually is the other way around it seems. When downloading a file, the line endings get rewritten, but not when exporting as ZIP. In that case, line endings are preserved.

Rotonen commented 5 years ago

Is it the browser being helpful versus its idea of your filesystem?

njohner commented 5 years ago

For the download probably. this is directly a stream response, with filename and such in the header. For the zipexport probably not, there we explicitely write the files

Rotonen commented 5 years ago

Verify the browser making the difference against output from curl into a file.

njohner commented 5 years ago

Original ticket with attached E-mails and such: https://extranet.4teamwork.ch/support/gever-ai/tracker/318

njohner commented 5 years ago

Ok so the issue comes from mail-in. Mail-in transforms CRLF to LF. Then zip writes the file as is, whereas download transforms back to CRLF (also via curl). So basically they have a problem with the zip export because the zip export does it right :-).

njohner commented 5 years ago

So the magic happens here: https://github.com/4teamwork/ftw.mail/blob/master/ftw/mail/inbound.py#L57 This then calls python's email.message.Message.as_string, then email.generator.Generator which uses \n as line separators.

Also not that on my machine the E-mail still works fine, even after the line endings have been changed. I'm not too sure whether we want to fix this and how.

njohner commented 5 years ago

Proposed solution: We could transform line endings during zip export.

deiferni commented 5 years ago

We should check if there is an RFC that takes a stance on what line separators shall be used. Or if both are valid. Also rewriting the line separator during download may mess with maximum line length limitations as they add an additional character. That may cause trouble further on.

njohner commented 5 years ago

It seems that CRLF is the standard for E-mail bodies (https://tools.ietf.org/html/rfc2822). Actually python's mail does not touch those, so they come out right, but it does change the line endings between header lines and parts, using \n there. It seems that this used to be acceptable but is not anymore.

njohner commented 5 years ago

Here some relevant parts from the standard:

   Messages are divided into lines of characters.  A line is a series of
   characters that is delimited with the two characters carriage-return
   and line-feed; that is, the carriage return (CR) character (ASCII
   value 13) followed immediately by the line feed (LF) character (ASCII
   value 10).  (The carriage-return/line-feed pair is usually written in
   this document as "CRLF".)

   A message consists of header fields (collectively called "the header
   of the message") followed, optionally, by a body.  The header is a
   sequence of lines of characters with special syntax as defined in
   this standard. The body is simply a sequence of characters that
   follows the header and is separated from the header by an empty line
   (i.e., a line with nothing preceding the CRLF).
2.3. Body

   The body of a message is simply lines of US-ASCII characters.  The
   only two limitations on the body are as follows:

   - CR and LF MUST only occur together as CRLF; they MUST NOT appear
     independently in the body.

   - Lines of characters in the body MUST be limited to 998 characters,
     and SHOULD be limited to 78 characters, excluding the CRLF.

   Note: As was stated earlier, there are other standards documents,
   specifically the MIME documents [RFC2045, RFC2046, RFC2048, RFC2049]
   that extend this standard to allow for different sorts of message
   bodies.  Again, these mechanisms are beyond the scope of this
   document.
4. Obsolete Syntax

   Finally, certain characters that were formerly allowed in messages
   appear in this section.  The NUL character (ASCII value 0) was once
   allowed, but is no longer for compatibility reasons.  CR and LF were
   allowed to appear in messages other than as CRLF; this use is also
   shown here.
phgross commented 5 years ago

In the mail download we manually convert the line endings, see https://github.com/4teamwork/opengever.core/blob/7390b2e4afd830f31cec55285d91bdf490b041fa/opengever/mail/browser/download.py#L17 or https://github.com/4teamwork/opengever.core/pull/638. So my suggestion would be to do the same for zip-export.