Open rth opened 7 years ago
To clarify, this issue is with large emails that have attachments (the above dataset contained a fair number of those). When processing emails without attachments, the builtin email parser performs quite well. This is rather consistent with the mailgun/flanker benchmarks.
For instance, the fedora-devel
mailing list 2003-2010, 130 000
emails can be parsed in 96 sec, and threaded in 1.9 s.
Given the fact that first 2 solutions have no PY3 support, are you more likely to selet the 3rd one?
Well I don't think it's an issue at the moment, we'll just assume that proper pre-processing is done on emails (in which case the built-in parser works fine). This is just for future reference..
Got it
The default
email.Parser
(converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,email.Parser
: 33.329 sjwthreading.Message
format: 0.121sA solution could be to,
References:
,In-Reply-To:
andSubject
header fields, for the JWZ algorithm )