domainaware / parsedmarc

A Python package and CLI for parsing aggregate and forensic DMARC reports
https://domainaware.github.io/parsedmarc/
Apache License 2.0
961 stars 209 forks source link

Multithread the mail-message processing #509

Open rodpayne opened 2 months ago

rodpayne commented 2 months ago

I have been experimenting with multithreading the mail-message processing. Each mail message in a batch is processed "in parallel" so that when one thread is waiting for a DNS timeout or other I/O, another one can keep on processing. I tried multiprocessing too, but could not work out how to share the cache files instead of duplicating (and diluting) them between the processes. On my system at least, the CPU is not a bottleneck, so full multiprocessing does not provide much more benefit.

Let me know what you think. There is probably more cleanup to be done. Maybe a little more restructuring to handle saving the results in the thread. This may also play into the goal of saving the results before moving or deleting the mail message.

rodpayne commented 2 months ago

The code is working fine, at least for my use case (graph mailbox). I have run the following benchmarks to look at the performance.

Significant Options:

[general]
n_procs = 4
dns_timeout = 10.0
…
[msgraph]
…
[mailbox]
reports_folder = Inbox/OneDaySample
batch_size = 50
archive_folder = Archive/OneDaySample

The one-day sample has 428 mail messages with 513,330 reports from 4/10/2024.

Version 8.10.3 + #509 (multithreading change)

Run with batch_size = 50 and dns_timeout = 10.0 Elapsed time: 02:08:41.

Rerun with batch_size = 50 and dns_timeout = 6.0 Elapsed time: 01:09:03 * (Can't explain the outlier.) Elapsed time: 00:39:08 Elapsed time: 00:37:26 Elapsed time: 00:38:25

Rerun with batch_size = 500 and dns_timeout = 6.0 Elapsed time: 00:22:54 Elapsed time: 00:20:04

Version 8.10.3 w/o #509 (cache improvements only)

Run with batch_size = 50 Also, effectively, dns_timeout = 6.0 because of a bug in propagating the setting. Elapsed time: 01:08:15

Rerun with batch_size = 500 Elapsed time: 00:46:39

Version 8.6.4 (before cache and multithreading changes)

With batch_size = 50 Elapsed time: 21:57:12 (Yes, almost a day to process a day's mail messages.)

seanthegeek commented 1 month ago

Hi,

Sorry I'm just getting around to addressing PRs/ Can you rebase this PR and fix the conflicts?

rodpayne commented 2 weeks ago

I have been off on sick leave. I should be back to work in the next few weeks, and I will look at it then.