Fix: Less error-prone code to minimize the risk of program crash

domainaware / parsedmarc

A Python package and CLI for parsing aggregate and forensic DMARC reports

https://domainaware.github.io/parsedmarc/

Apache License 2.0

995 stars 213 forks source link

Fix: Less error-prone code to minimize the risk of program crash #417

Closed andersnauman closed 1 year ago

andersnauman commented 1 year ago

Double check if org_name exist. Empty name will crash Elastissearch's 'exist' search
Move try-statement to include open() to catch if files do not exist
Enclose Elasticsearch's execute in a try-statement to catch any invalid searches when variables are empty/missing

qwell commented 1 year ago

I'm not sure if it's the same issue that prompted you to make this PR, but I was just having some problems with reports from Yahoo, where org_name is Yahoo, rather than yahoo.com or yahooinc.com. When get_base_domain() gets called, it returns None. The fallback of using the domain from email happens before checking whether org_name is None, which means a fallback never happens for an "invalid" org_name.

For reference, my data is <feedback><report_metadata><org_name>Yahoo</org_name><email>dmarchelp@yahooinc.com</email><snipped/></report_metadata><snipped/></feedback>

andersnauman commented 1 year ago

@qwell It is in the ballpark of being the same issue but I believe your issue have been fixed in 8.6.1 while the version on PyPI is currently 8.6.0. This PR is more aimed to make the program not crash if it hits these cases where parsing for any reason goes wrong.

The new code in 8.6.1 should never overwrite org_name with None, although it can still be missing a value since org_name and email could potentially have empty values in the XML. For atleast an elasticsearch implementation, these empty values will crash the code and unfortunately also ignore any reports that came after the crashing one. This PR makes sure that the program do not crash and continues on with the remaining reports.

Until PyPI is updated, I use git to create patches for the missing code and then apply those patches into the Docker container. A more suitable way for most people would probably be using git clone + pip install /path/to/git-repo

gdchamal commented 1 year ago

I encounter today same issue where elasticsearch raise an uncaught exception doing search with a missing value.

+1 for this PR ;)

seanthegeek commented 1 year ago

@andersnauman For some reason this is failing unit testing. I haven't looked into it yet.

andersnauman commented 1 year ago

@seanthegeek You are correct :), although, I think those tests needs some overhaul since they have been lurking for failure for some time now. def testAggregateSamples(self): reads test-files which raises InvalidDMARCReport, but those exception are never caught in the test.

You will need to make the test more specific to some of the files, like the example below:

if sample_path == "samples/broken/report":
    with self.assertRaises(parsedmarc.InvalidDMARCReport):
        parsed_report = ...