ePADD / epadd

ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.
https://www.epaddproject.org
112 stars 24 forks source link

Characters represented in quoted printable encoding are sometimes not decoded during import of mbox files since v7.3. #395

Closed jfarwer closed 3 years ago

jfarwer commented 3 years ago

The problem seems to be the method getRawInputStream() which from v7.3 is used under some circumstances instead of getInputStream() in EmailFetcherThread.java. The decoder (in QuotedPrintableCodec.decodeQuotedPrintable(b)) is sometimes not able to read the byte array returned by getRawInputStream() and throws an exception. This results in imported email text containing the undecoded quoted printable representations of characters rather than the decoded characters (for example =20 instead of a space).

jfarwer commented 3 years ago

Update

The issue seems to be that the used version of the decoder QuotedPrintableCodec.decodeQuotedPrintable does not implement the complete set of rules of the quoted-printable spec. In particular soft line breaks are not supported. Therefore if an email in an imported mbox file contains an equal sign as the last character of a line (indicating a soft line break), the decoder throws an exception. The email will then be imported without decoding. The decoder supports the full quoted-printable specification from version 1.10. Changing the version of commons-codec from 1.5 to 1.10 in the corresponding pom file seems to solve the issue.