domainaware / parsedmarc

A Python package and CLI for parsing aggregate and forensic DMARC reports
https://domainaware.github.io/parsedmarc/
Apache License 2.0
1.02k stars 223 forks source link

performance issue #107

Closed CodeNameTheOnlyOne closed 5 years ago

CodeNameTheOnlyOne commented 5 years ago

the software is working great but it seams to have a hard time processing the logs i get from gmail. they are fairly large logs and take about 4+ hours to process a single file. the box shows no load during this time and i see little network traffic doing dns query. is there something im missing on these. and when they do finish they crash the program trying to move the file(so the processes repeats forever )

i have got them to go through but it takes many attempts. the messages are quite large and prob have 20000+ emails

parsedmarc -c /etc/parsedmarc.ini [A 0it [00:00, ?it/s] DEBUG:init.py:992:Found 1 messages in INBOX DEBUG:init.py:996:Processing message 1 of 1: UID 653 DEBUG:init.py:1050:Moving aggregate report messages from INBOX to Archive /Aggregate DEBUG:init.py:1056:Moving message 1 of 1: UID 653 ERROR:init.py:1063:IMAP error: Error moving message UID 653: Disconnected for inactivity. ERROR:cli.py:573:IMAP Error: Disconnected for inactivity.

libexpand commented 5 years ago

Is it gmail throttling the connection?

rubicondimitri commented 5 years ago

Hi all,

I did some tests and what I found is that when the DMARC report contains too many sending IPs without reverse hostnames then the reverse DNS query takes too long so the IMAP TCP connection is "lost".

I tried processing big DMARC report with this configuration in the parsedmarc.ini : [general] save_aggregate = True save_forensic = True debug = True offline = True

and it worked well.

Unfortunatly I don't have the reverse DNS hostname for the sending IP thus it might be harder to identify the sending source.

I will try to play around the "dns_timeout" parameter in [general]

CodeNameTheOnlyOne commented 5 years ago

@Vico999 there is no connection to gmail. they are only sending me the report. @rubicondimitri i would like to have the reverses if i can. also would like to have the geoip info as that is really helpful. i am trying now with a .1 dns timeout and am seeing better results. by the way do you know what units the dns_timeout is in. im assuming in sec as i saw somewhere it defaults to 2.

also it looks like this error needs to be handled better. it could try and reconnect to the imap vs just giving up.

rubicondimitri commented 5 years ago

I now dowload e-mails to a folder, then I rune the command :

parsedmarc -c parsedmarc.ini emails/*

and now it's working fine

CodeNameTheOnlyOne commented 5 years ago

i have had no issues after setting a .1 dns timeout

rubicondimitri commented 5 years ago

Hi Code,

How big are your DMARC reports? I'm testing about 3 thousand emails but I would be glad to know how long does it takes to process DMARC report with more than 300k email?

thanks!

CodeNameTheOnlyOne commented 5 years ago

last one i got from google was 13k, hard to say how long it takes as i no longer have to babysit it. with default dns timeouts i was waiting 4+ hrs. with .1 dns timeouts i think they take half an hour. seems like they have slowed down with the reports(or the spammers have slowed down)