Open rodpayne opened 2 months ago
The code is working fine, at least for my use case (graph mailbox). I have run the following benchmarks to look at the performance.
[general]
n_procs = 4
dns_timeout = 10.0
…
[msgraph]
…
[mailbox]
reports_folder = Inbox/OneDaySample
batch_size = 50
archive_folder = Archive/OneDaySample
Run with batch_size = 50
and dns_timeout = 10.0
Elapsed time: 02:08:41
.
Rerun with batch_size = 50
and dns_timeout = 6.0
Elapsed time: 01:09:03
* (Can't explain the outlier.)
Elapsed time: 00:39:08
Elapsed time: 00:37:26
Elapsed time: 00:38:25
Rerun with batch_size = 500
and dns_timeout = 6.0
Elapsed time: 00:22:54
Elapsed time: 00:20:04
Run with batch_size = 50
Also, effectively, dns_timeout = 6.0
because of a bug in propagating the setting.
Elapsed time: 01:08:15
Rerun with batch_size = 500
Elapsed time: 00:46:39
With batch_size = 50
Elapsed time: 21:57:12
(Yes, almost a day to process a day's mail messages.)
Hi,
Sorry I'm just getting around to addressing PRs/ Can you rebase this PR and fix the conflicts?
I have been off on sick leave. I should be back to work in the next few weeks, and I will look at it then.
I have been experimenting with multithreading the mail-message processing. Each mail message in a batch is processed "in parallel" so that when one thread is waiting for a DNS timeout or other I/O, another one can keep on processing. I tried multiprocessing too, but could not work out how to share the cache files instead of duplicating (and diluting) them between the processes. On my system at least, the CPU is not a bottleneck, so full multiprocessing does not provide much more benefit.
Let me know what you think. There is probably more cleanup to be done. Maybe a little more restructuring to handle saving the results in the thread. This may also play into the goal of saving the results before moving or deleting the mail message.