Faster email parser - Githubissues

rth commented 7 years ago

The default email.Parser (converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,

email.Parser: 33.329 s
converting to jwthreading.Message format: 0.121s
the JWZ threading algorithm: 0.031s
sorting of threads: 0.002s

A solution could be to,

use a MIME parser from https://github.com/mailgun/flanker (though no PY3 support for the moment and has a lot of additional dependencies)
adapt the https://github.com/jkr/pygmime (no PY3 support either, cross-platform support would be difficult)
write a custom simplified email.Parser (we only require the References:, In-Reply-To: and Subject header fields, for the JWZ algorithm )

rth commented 7 years ago

To clarify, this issue is with large emails that have attachments (the above dataset contained a fair number of those). When processing emails without attachments, the builtin email parser performs quite well. This is rather consistent with the mailgun/flanker benchmarks.

For instance, the fedora-devel mailing list 2003-2010, 130 000 emails can be parsed in 96 sec, and threaded in 1.9 s.

DenisFLASH commented 7 years ago

Given the fact that first 2 solutions have no PY3 support, are you more likely to selet the 3rd one?

rth commented 7 years ago

Well I don't think it's an issue at the moment, we'll just assume that proper pre-processing is done on emails (in which case the built-in parser works fine). This is just for future reference..

DenisFLASH commented 7 years ago

Got it

FreeDiscovery / jwzthreading

Faster email parser #5