FreeDiscovery / jwzthreading

Implementation of the JWZ threading algorithm for e-mail or newsgroup messages.
Other
13 stars 2 forks source link

Faster email parser #5

Open rth opened 7 years ago

rth commented 7 years ago

The default email.Parser (converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,

A solution could be to,

rth commented 7 years ago

To clarify, this issue is with large emails that have attachments (the above dataset contained a fair number of those). When processing emails without attachments, the builtin email parser performs quite well. This is rather consistent with the mailgun/flanker benchmarks.

For instance, the fedora-devel mailing list 2003-2010, 130 000 emails can be parsed in 96 sec, and threaded in 1.9 s.

DenisFLASH commented 7 years ago

Given the fact that first 2 solutions have no PY3 support, are you more likely to selet the 3rd one?

rth commented 7 years ago

Well I don't think it's an issue at the moment, we'll just assume that proper pre-processing is done on emails (in which case the built-in parser works fine). This is just for future reference..

DenisFLASH commented 7 years ago

Got it