Packet-Clearing-House / DNSAuth

Go-lang app to aggregate DNS queries to 1 minute buckets and write them to influxdb
MIT License
2 stars 2 forks source link

Delete file option doesn't delete all files #26

Open mrjones-plip opened 6 years ago

mrjones-plip commented 6 years ago

I'm using the just released 1.2.0 and have set up DNSAuth to delete my files. Here's my dnsauth.toml:

customer-db = "root:@(prefix-dev.pch.net)/dns_query"
customer-refresh = 10
influx-db = "http://127.0.0.1:8086/write?db=authdns"
watch-dir = "/home/dnsauth/"
cleanup-action = "delete"
cleanup-dir = "/tmp"

I was using Faux-Logs to push a couple hundred log files into /home/dnsauth every minute. This worked great! Every time I saw DNSAuth run, I would see the file count drop via ls -ahl /home/dnsauth/|wc -l.

However, letting it run overnight, I see that there's a lot of files in the directory:

ls -ahl /home/dnsauth/|wc -l
113719

This means that DNSAuth isn't deleting all the files.

Steps to reproduce:

  1. Do a clean install of dnsauth and start it. Ensure that in dnsauth.toml you have:
    watch-dir = "/home/dnsauth/"
  2. Download Faux-Logs
  3. use this for config.php:
    <?php
    $fl_config['pre'] = array(
        array('1.1.1.1'),
        array('asia.afilias-nst.asia.', 'asia.', 'apia.asia.', 'apix.asia.', 'smivision.asia.', 'ituneslive.asia.'),
        "int::0-1",
        array('1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','1.2.3.4','2001:0db8:0a0b:12f0:0000:0000:0000:0001'),
        array('1','2','5', '6', '12'),
    );
    $fl_config['log'] = array(
        "str::Q", // Q/R, flag for query or response
        "str:: ",
        "pre::3", // source IP of host making the query
        "str:: ",
        "pre::0", // nameserver IP, used to determine customer
        "str:: ",
        "pre::2", // Proto; 0=UDP, 1=TCP
        "str:: ",
        "str::0", // opCode; 0=Query, 4=Notify, 5=Update, etc.
        "str:: ",
        "pre::4", // qType; 1=A, 2=NS, 5=CNAME, 6=SOA, 12=PTR, etc
        "str:: ",
        "pre::1", // query string
        "str:: ",
        "int::50-69", // packet size in bytes (used by roots)
        FAUX_NL,
        "str::R", // Q/R, flag for query or response
        "str:: ",
        "pre::3", // source IP of host making the query
        "str:: ",
        "pre::0", // nameserver IP, used to determine customer
        "str:: ",
        "pre::2", // Proto; 0=UDP, 1=TCP
        "str:: ",
        "str::0", // opCode; 0=Query, 4=Notify, 5=Update, etc.
        "str:: ",
        "pre::4", // qType; 1=A, 2=NS, 5=CNAME, 6=SOA, 12=PTR, etc
        "str:: ",
        "pre::1", // query string
        "str:: ",
        "int::400-700", // packet size in bytes (used by roots)
        "str:: ",
        array('0','2','3'), // Response code if field 1 was an R; 0=NOERROR, 3=NXDOMAIN, 2=SERVFAIL, etc.
    );
  4. use this for config2.php. You may have to add more entries - I have 104 entries in my config2 file:
    <?php
    $files[] = 'SZC_mon-01.acc.foo.net';
    $files[] = 'SZC_mon-01.akl.foo.net';
    $files[] = 'SZC_mon-01.ams.foo.net';
    $files[] = 'SZC_mon-01.ark.foo.net';
    $files[] = 'SZC_mon-01.atl.foo.net';
    $files[] = 'SZC_mon-01.ber.foo.net';
    $files[] = 'SZC_mon-01.bey.foo.net';
    $files[] = 'SZC_mon-01.bjl.foo.net';
    $files[] = 'SZC_mon-01.bom.foo.net';
  5. run Faux-Logs with: php -f multi-file.gzip.php /home/dnsauth/ 30 610000

expected: dnsauth ingests all the files to influxdb and deletes all the files. at any given point there should only be a couple hundred files in the watch-dir between when dnsauth runs and the files are being created. actual: after several hours there over a hundred thousand files in the watch-dir

Update - changing the dnsauth.toml to use cleanup-action = "move", I end up with an exact duplicate of of index directory as in my cleanup-dir. We can re-title this bug as needed.

mrjones-plip commented 6 years ago

FYI @unix1

mrjones-plip commented 6 years ago

Update: i've been running this set up for 2 hours. I started with 0 files and now I have 9138 that dnsauth hasn't processed. Faux-Logs has written 203,910 files in those two hours.

unix1 commented 6 years ago

This is not good. The problem could either be in cleaning up, or somewhere prior. If you, by any chance, still have the output and files (unlikely given this was few days ago), a couple of additional pieces to observe would be:

If none of the above are observed, it is likely that there might be a bug in the timing/way the files are picked up. It certainly should be looked into.

mrjones-plip commented 6 years ago

Tomorrow morning I'll start clean, follow my steps exactly (above) and post the remaining un-ingested logs and dnsauth error log in hopes that it illuminates the problem. stay tuned!

mrjones-plip commented 6 years ago

Ack! I forgot to do this today. I'll grab it tomorrow or Wed and report back.

mrjones-plip commented 6 years ago

OK! I'm on this now. I've recorded how I set up my test, but it was basically, stop dnsauth, empty syslog, empty dnsauth log dir, cat config for dnsauth, cat 2 configs for Faux-Logs, start dnsauth, start Faux-Logs. See issue.26.preflight.zip. I'll stop this after about an hour and send my zipped syslog and the last line from Faux-Logs to tell you aprox how many files it wrote. As well, if it's not too big, I'll send the contents of the dnsauth log dir. In time since I've started this about 5 min ago there's already been about 15k files written by Faux-Logs (and presumably ingested and deleted by dnsauth ; )

Stay tuned!

mrjones-plip commented 6 years ago

Cool - we didn't need to wait very long for the problem to manifest. The setup above ran for about 15-20 min and Faux-Logs thinks it wrote 1M+ lines across 40k- files:

Files Written: 39,500, Lines Written: 1,185,000

After stopping Faux-Logs and waiting 2 full minutes for dnsauth to iterate a few more times, and then stopping dnsauth, there were exactly 18k files that didn't get deleted:

root@dnsauth:# ls /home/dnsauth/|wc -l                                                  
1800 

I then archived the remaining log files and syslog:

cp -r /home/dnsauth/ remaining.logfiles.issue.26
zip -r remaining.logfiles.issue.26.zip remaining.logfiles.issue.26/
cp /var/log/syslog syslog.issue.26
zip syslog.issue.26.zip syslog.issue.26

I've attached these files here: syslog.issue.26.zip remaining.logfiles.issue.26.zip

mrjones-plip commented 5 years ago

@unix1 - I think this might be a red herring. Per my email the other day, I created a prototypical pcap file (proto.log.gz) and then wrote a bash script to generate a valid file name and copy it into the DNSAuth directory:

#!/bin/bash
TARGET_DIR="/home/dnsauth/"
SOURCE_FILE="proto.log.gz"
echo "will put a lot of log files up in your grill 'til you stop me..."
while true; do
        sleep 0.05 &
        DAY=`date '+%d'`
        MONTH=`date '+%m'`
        YEAR=`date '+%Y'`
        HR=`date '+%H'`
        MN=`date '+%M'`
        NEW_FILE="$RANDOM$RANDOM-SZC_mon-01.zrh.woodynet.net_$YEAR-$MONTH-$DAY.$HR-$MN.dmp.gz"
        `cp ${SOURCE_FILE} ${TARGET_DIR}${NEW_FILE}` 
        wait 
done

after running it for 20 min or so, I was unable to reproduce the problem.

If you agree, I think this is endemic to the way Faux-Logs is written and we should close this issue as "can't reproduce". Lemme know!

Update - Lemme update the bash script to have an inner loop that creates 100 files per outer loop - one per POP that Faux-Logs does. In case the number of POPs has anything to do with it. Stay tuned!

mrjones-plip commented 5 years ago

ok, doesn't seem to be reproducing with a LOT of POPs:

#!/bin/bash

TARGET_DIR="/home/dnsauth/"
SOURCE_FILE="proto.log.gz"
# mon-01.xyz.foonet.net_2017-10-17.17-07.dmp.gz
echo "will put a lot of log files up in your grill 'til you stop me..."
declare -a POPS=("acc" "akl" "ams" "ark" "atl" "ber" "bey" "bjl" "bom" "bur" "bze" "cai" "cdg" "cmb" "coo" "cor" "cpt" "dar" "dfw" "dub" "dur" "dxb" "edi" "ewr" "eze" "fih" "fra" "gbe" "gnd" "gye" "gza" "iad" "icn" "jax" "jkt" "jnb" "kbp" "kgl" "kin" "kla" "klu" "ktm" "kye" "lad" "lba" "lbv" "lfw" "lga" "los" "lpb" "lys" "man" "mba" "mex" "mgm" "mia" "mke" "mnl" "nbe" "nrt" "ord" "pao" "pap" "pcb" "pdx" "per" "phx" "pnh" "pos" "prg" "ric" "rno" "rob" "scl" "sdb" "sea" "sfo" "sgu" "sin" "sjo" "slu" "sna" "sof" "syd" "szg" "tgu" "tnr" "tpa" "tun" "vli" "wdh" "wlg" "yhz" "yow" "yul" "ywg" "yxe" "yyc" "yyz" "zrh")
while true; do
        sleep 1.0 &
        DAY=`date '+%d'`
        MONTH=`date '+%m'`
        YEAR=`date '+%Y'`
        HR=`date '+%H'`
        MN=`date '+%M'`
        for POP in "${POPS[@]}"
        do      
                NEW_FILE="$RANDOM$RANDOM-SZC_mon-01.$POP.woodynet.net_$YEAR-$MONTH-$DAY.$HR-$MN.dmp.gz"
                `cp ${SOURCE_FILE} ${TARGET_DIR}${NEW_FILE}`
        done
        #echo "cp ${SOURCE_FILE} ${TARGET_DIR}${NEW_FILE}" 
        wait # for sleep
done

I'll let this run for ~1 hour and report back.

mrjones-plip commented 5 years ago

@unix1 - k, I'm gonna close this. I let the above bash script run for an hour and there wasn't the overrun of files like with Faux-Logs. Sorry for the false alarm! On the off chance that this breaks in production, I'll re-open this ticket!

mrjones-plip commented 5 years ago

Oh yeah, just testing now a bit more and changing sleep 1.0 & to sleep 4.0 & I actualy see the file count via ls -hl /home/dnsauth/|wc -l go to zero for sec between the bash script sleep and dnsauth's 30 second ingestion loop.

mrjones-plip commented 5 years ago

@unix1 - bad news! I finally got around to pushing the new version of DNSAuth to production and this problem is present. Reopening and ping me about access to prod so we can troubleshoot.

mrjones-plip commented 5 years ago

@unix1 - oh wait! I see now that all the files it's skipping are only 20 bytes compressed and 0 bytes when uncompressed. However, I feel like this is a bug! What DNSAuth should be doing is still processing the file and deleting it, possibly noting that it had no lines. Do we want to track this in this ticket or a new one?

unix1 commented 5 years ago

@Ths2-9Y-LqJt6 I'd suggest a new ticket would be better for handling empty files. The original problem described in this issue seemed a lot more substantial.

jumpifnotzero commented 5 years ago

I’d prefer DNSAuth moved files, that cause an issue (unexpected data etc), to another directory for subsequent analysis.

This’ll likely require a supervisory component to protect against disk resource exhaustion. A cheap version may be to move files while the directory is not empty or contains less than X files. I assume that we wouldn’t encounter different errors at the same time so one may be enough.

On Dec 4, 2018, at 7:25 AM, unix1 notifications@github.com wrote:

@Ths2-9Y-LqJt6 I'd suggest a new ticket would be better for handling empty files. The original problem described in this issue seemed a lot more substantial.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.