domainaware / parsedmarc

A Python package and CLI for parsing aggregate and forensic DMARC reports
https://domainaware.github.io/parsedmarc/
Apache License 2.0
986 stars 214 forks source link

Graph mail connector only pulls 10 messages per run #333

Closed Lauwnch closed 2 years ago

Lauwnch commented 2 years ago

When running parsedmarc using the MSGraph mailbox connector, only 10 messages are processed per execution. I tried setting mailbox["batch_size"] to 50, but still only 10 messages were processed per execution. For reference I am running the latest version of parsedmarc available on pip, not running as a service. Maybe I this was a bad assumption, but I kind of thought that parsedmarc would process every message in the mailbox I pointed it to.

Per Microsoft documentation, Graph paginates return data from the .../messages endpoint, with default page size of 10 (https://docs.microsoft.com/en-us/graph/api/user-list-messages?view=graph-rest-1.0&tabs=http). So since there is currently no logic to handle pagination in parsedmarc/mail/graph.py in MSGraphConnection.fetch_messages, this is expected behavior on the MS side at least.

I think there's a few options here, eg have fetch_messages return a generator of result pages (although this might mess up other mailbox connection types,) or have fetch_messages run through all the pages and return a full list of messages, or let the configured batch_size determine the page size returned from graph by adding a top=batch_size parameter to the URL used in fetch_messages.

If I can find a good way to test this out I would try doing a pull request, I'm not super GitHub literate though and I've already just run parsedmarc multiple times in my own environment to get through our messages :).