Many of us are using yahoo-group-archiver to back up Yahoo Groups API results. This script takes the output of that tool, and converts it into individual email files, mbox mail folders, and optionally, PDF files.
Mail folders stored as mbox can be imported by a wide range of desktop and server-side email clients, including Thunderbird (Linux, Mac, Windows), Apple Mail.app (Mac), Microsoft Outlook (Windows and Mac).
Many non-technical users won't know what to do with an mbox file, but will really appreciate getting a PDF file containing all the emails in the list. You can enable experimental PDF support by installing Andrew Ferrier's email2pdf script. This process is known to be buggy, and your bug reports would be appreciated.
qpdf
to avoid running out of memory (there are packages for Yum/RPM, Debian/Ubuntu, and MacOS brew)mkdir output-dir
yahoo-group-archive-tools.pl --source <archived-input-dir> --destination <output-dir>
Start by installing Andrew Ferrier's email2pdf script. It can be a little complicated to install, but giving someone a PDF file of their list can elicit delight. This is experimental, so bug reports are appreciated.
mkdir output-dir
yahoo-group-archive-tools.pl --source <archived-input-dir> --destination <output-dir> --pdf --email2pdf <path to email2pdf Python script>
The output directory will contain:
email
folder containing standalone email files for every email in the archive, e.g. email/1.eml
, email/2.eml
. The emails won't be pristine, because Yahoo redacts email addresses (see that and other caveats below). The email IDs reflect those downloaded by yahoo-group-archiver, and it's normal to see some gaps in keeping with the original numbering.mbox/list.mbox
, for the entire history of the listpdf-individual
directory containing individual PDFs for every emailpdf-combined
directory with a single PDF file containing every emailThe Yahoo Groups API redacts emails found in message headers. For
example, they'll rewrite ceo@ford.com
as ceo@...
.
Why is this bad?
ceo@ford.com
and ceo@toyota.com
look the same if both are truncated to ceo@...
.Because the API tells us the submitting Yahoo user's username, we can make a fake email domain that preserves the part before the @ in redacted emails, while being unique per user.
ceo@ford.com
(Yahoo ID fordfan
), emails the list:
ceo@...
ceo@fordfan.yahoo.invalid
ceo@toyota.com
(Yahoo ID toyotalover123
), emails the list:
ceo@...
even though this is a totally different personceo@toyotalover123.yahoo.invalid
, which is different from ceo@fordfan.yahoo.invalid
We make this change in several headers that include the original sender's email, including From
and Message-Id
. We save the original redacted version as an X- header. For example, if Yahoo says an email is From: ceo@...
, we modify that to From: ceo@ceo123.yahoo.invalid
, and save the original as X-Original-Yahoo-Groups-Redacted-From:
ceo@...
. If we don't have a Yahoo profile name (e.g. "ceo123"), we use the numeric Yahoo user ID (e.g. "123456789") instead.
The Yahoo Groups API detaches all attachments, and saves them in a separate place.
We try to stitch the emails back together, navigating through the MIME structure to attach the right attachment at the right place. In some cases, we're not able to identify where in the email MIME structure an attachment goes, so we reattach orphaned attachments to the whole email. In some cases, Yahoo doesn't give us the attachment, so we replace the attachment with a text part containing an error message, with original attachment-related headers added (X-Yahoo-Groups-Attachment-Not-Found
, X-Original-Content-Type
, X-Original-Content-Disposition
, X-Original-Content-Id
).
The Yahoo Groups API forcibly truncates email messages with over 64 KB in text, and places a truncation message right in the middle of encoded content, e.g. Base64.
Whenever we see an email body that end with (Message over 64 KB, truncated)
, we remove that string from the broken message part, and pray that downstream parsers will be able to deal with truncated HTML, Base64, etc. We mark these message parts with a X-Yahoo-Groups-Content-Truncated
header.
The Yahoo Groups API appears to be decoding and recoding textual message bodies, because we see Unicode "U+FFFD" replacement characters in the raw RFC822 text that should be 7-bit clean. We're also seeing ^M linefeeds at the end of every header line and MIME body part.
We remove invalid linefeeds and 8-bit characters from 7-bit RFC822 text.
Installing this script requires installing many CPAN dependencies. If you're confused, feel free to search for things like "installing CPAN Text::LevenshteinXS
module rather than Text::Levenshtein::XS
, so they just changed the dependency in the Perl script to match.)
This tool directly executes the email2pdf
script specified by the --email2pdf
option. Make sure the #!
shebang line is set to the Python interpreter of your choice. You can test email2pdf execution by manually running something like email2pdf --headers -i <a .eml file> --output-file <the-filename-to-write.pdf>
on a single email/[number].eml
file generated by this script.
Significant changes:
This software is copyright Anirvan Chatterjee, and licensed under the MIT License.
Have questions, bug reports, or suggestions? Feel free to use GitHub's issue tracker. If you need to contact me privately, DM me @anirvan on Twitter.