domainaware / parsedmarc

A Python package and CLI for parsing aggregate and forensic DMARC reports
https://domainaware.github.io/parsedmarc/
Apache License 2.0
994 stars 213 forks source link

corrupted files/input/aggregate.json #447

Open Pascal76 opened 10 months ago

Pascal76 commented 10 months ago

Hello,

I often have corrupted files/input/aggregate.json files. I have to fix the xml file manually :(

2023-12-07_20h38_19 2023-12-07_20h39_00

regards, Pascal

AnaelMobilia commented 8 months ago

Hello @Pascal76 ,

Could you, please, share a DMARC report producing this issue ? I just try on my data on the output looks clean : image

As the "xml_schema" haven't the same value as with your screenshot, I suspect this could be a linked library issue.

Pascal76 commented 7 months ago

I sent you a report per email since weeks now ... did you receive it ?

I fix the file like that :

`<?

$json_file = file_get_contents(DIR."/files/input/aggregate.json");

if ($json_file === '[]') { print date('Y-m-d H:i:s') . " | Nothing to do\n"; exit; }

$max_fix_attempts = 2; $nb_of_fixes = 0;

for($i=1;$i<=$max_fix_attempts;$i++) { $json = json_decode($json_file,true); if ($json) { print date('Y-m-d H:i:s') . " | json OK (errors: $nb_of_fixes)\n"; exit; } print date('Y-m-d H:i:s') . " | # $i | json KO\n"; if ($i === 1) { if (preg_match("/^[],/",$json_file)) { $json_file = preg_replace("/^[],/","[",$json_file); $nb_of_fixes++; continue; } } $json_file = preg_replace("/\s+}\n],\n\s+/","\n },\n ",$json_file); $nb_of_fixes++; }

print date('Y-m-d H:i:s') . " | Could not fix the file :(\n";

?> `

fourjay commented 7 months ago

I think I've run into something very similar. The fix that seems to work is to find lines that begin with, and contain only ], delete that line and add (move?) the comma to the closing curly brace above it. FWIW, I suspect it is triggered in yahoo.com's reports

Pascal76 commented 6 months ago

That is my lastest logs:

2024-03-08 22:01:08 - BEGIN parsedmarc -c /apache_sites/jbm/dmarc/parsedmarc.ini INFO:cli.py:1018:Starting parsedmarc DEBUG:init.py:1343:Found 1 messages in INBOX DEBUG:init.py:1351:Processing 1 messages DEBUG:init.py:1355:Processing message 1 of 1: UID 2994312 INFO:init.py:1024:Parsing mail from dmarc-noreply@linkedin.com on 2024-03-08 20:57:32+00:00 DEBUG:init.py:1399:Deleting message 1 of 1: UID 2994312 2024-03-08 22:01:10 - END parsedmarc -c /apache_sites/jbm/dmarc/parsedmarc.ini

Looking at the directories, I see that there is a aggregate.json file containing [] instead of no file at all. => for me the issue is not Yahoo.

Pascal76 commented 6 months ago

this issue is daily now :(

seanthegeek commented 6 months ago

If I recall correctly, LinkedIn is one of the few services that also sends forensic/ruf reports back. IF the only email is a ruf report, aggrogate,json will be an emptylist, [], and the results will be placed in a list in forensic.json instead.

restena-pyg commented 3 months ago

As @Pascal76 reported, I encounter the same JSON files corruption. The first parsed report generates a valid JSON content, next runs make it invalid. As a workaround, I have set batch_size to 1 and use a wrapper script based on jq which produces an output that fluent-bit can read/tail/parse.

#!/bin/bash
OUTPUT_DIR=/opt/parsedmarc/output
AGGREGATE_FILE=aggregate.json
FORENSIC_FILE=forensic.json
> $OUTPUT_DIR/$AGGREGATE_FILE
> $OUTPUT_DIR/$FORENSIC_FILE
/opt/parsedmarc/venv/bin/parsedmarc -c /etc/parsedmarc.ini
cat $OUTPUT_DIR/$AGGREGATE_FILE | jq -cMr .[] >> $OUTPUT_DIR/fixed$AGGREGATE_FILE
cat $OUTPUT_DIR/$FORENSIC_FILE | jq -cMr .[] >> $OUTPUT_DIR/fixed$FORENSIC_FILE