Open sebbASF opened 4 years ago
The code that parses boundary strings strips <>. This breaks parsing of some messages, for example the unit test corpus file tomcat-ancient-boundary.mbox which has the following boundary:
Content-Type: multipart/mixed; boundary="<<001-3e1dcd5a-119e>>"
Once parsed, the boundary becomes "<001-3e1dcd5a-119e>" which does not match.
There are two bugs for this: https://bugs.python.org/issue28945 https://bugs.python.org/issue29020 but unfortunately no fix in sight.
It's possible to monkey-patch the library by providing a replacement copy of the method email.utils.collapse_rfc2231_value.
It might make sense to add this as an option (at least initially) for the importer so that missing messages could be imported.
Attached is some test code to demonstrate the fix.
parse_email.py.zip
The code that parses boundary strings strips <>. This breaks parsing of some messages, for example the unit test corpus file tomcat-ancient-boundary.mbox which has the following boundary:
Content-Type: multipart/mixed; boundary="<<001-3e1dcd5a-119e>>"
Once parsed, the boundary becomes "<001-3e1dcd5a-119e>" which does not match.
There are two bugs for this: https://bugs.python.org/issue28945 https://bugs.python.org/issue29020 but unfortunately no fix in sight.
It's possible to monkey-patch the library by providing a replacement copy of the method email.utils.collapse_rfc2231_value.
It might make sense to add this as an option (at least initially) for the importer so that missing messages could be imported.
Attached is some test code to demonstrate the fix.
parse_email.py.zip