StackExchange / dnscontrol

Infrastructure as code for DNS!
https://dnscontrol.org/
MIT License
3.07k stars 389 forks source link

RFC: Get rid of spfcache.json because it creates more outages than it could ever have prevented #1497

Open tlimoncelli opened 2 years ago

tlimoncelli commented 2 years ago

The whole spfcache.json concept was done for misguided reasons. It adds complexity to protect us against a situation that has literally never happened. It was added to placate a particular gaslighting SRE, not for valid business reasons. In the years since it was implemented, it has created more work and more outages than it was intended to prevent. This cache isn't needed and should be removed from the code.

Now the long version...

What was the concern? The SPF optimizer introduced a dependency on DNS servers we don't control. While doing its job, it needs to gather the SPF records of all the domains involved in our SPF record, which includes domains outside of our control. Let's suppose our SPF record refers to the SPF record of Google. If 100 percent of Google's DNS servers were down AND the related DNS caches had expired the SPF optimizer would fail. In that situation, we would not be able to do a "dnscontrol push".

The concern was that we might have an outage or other emergency where we absolutely needed to do a "dnscontrol push" AND the right combination of DNS servers were down (basically, all DNS servers related to one of the domains we now depend on). In that case, we wouldn't be able to do a "dnscontrol push".

What was our solution? We stored any third-party DNS records that optimizer needed in spfcache.json so that we'd have access to the records if they weren't accessible.

Has that situation occurred? Never. Zero. Zip. Nada. Not at all. It would require an extremely popular domain's DNS servers to all be down, and the caching DNS servers that we use to also be down. That just doesn't happen.

Why is this an issue? If the spfcache.json file is out of date, dnscontrol refuses to run. Our CI/CD system can't update that file in an automated fashion, so instead it fails until a human can renew the cache. This requires a PR. Since it is a manual process, it gets ignored and other updates start failing.

In other words, we've had more outages because of this cache than it was intended to prevent.

What's the proposed solution? Remove the spfcache.json concept from the SPF optimizer. Do not generate the file. Do not output warnings/errors if the file is out of date. Just get rid of it entirely.

There is no backwards compatibility issue here. I had considered a "we like the old way" flag but, no.

extremeshok commented 2 years ago

My workflow has always been to delete the spfcache.json before every run.

Not removing the file created more issues than it solved. At least from my usage of dnscontrol over the years.

costasd commented 2 years ago

Haven't used the optimizer that much so can't speak for or against its cache deprecation; that said, I do agree with the general sentiment that it makes things more complicated without bringing a lot of value back.

One thing to note though, is that probably some part of the cache functionality has to remain in pkg/spflib to accommodate offline parser tests ie. the issue described in #1071. And then there is #1348, which is a different issue almost requiring the spfcache... :shrug:

patschi commented 2 years ago

I've been using and updating the spfcache during the CI process automatically. I run dnscontrol check, which causes the SPF-cache to be updated. A script then detects if the file has changed and automatically performs a git commit to the repo. So any sequential run, the file is up-to-date.

So I never had any issue with the spfcache on my end.

tlimoncelli commented 2 years ago

@patschi Ah! Very interesting! Where I work, we have an unofficial policy that a CI process can't check things into a git repo.

patschi commented 2 years ago

Interesting, didn't know there're policies for such things. I only use dnscontrol for my private lab, so no restrictions there.

If anyone is interested, that's the script I've built:

#!/usr/bin/env bash

SCRIPTPATH="$( cd -P "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $SCRIPTPATH/../

set -x

if [ -f "spfcache.updated.json" ]; then
    echo "SPFCache was updated...";

    echo "Getting git access ready..."
    mkdir ~/.ssh/
    set +x
    echo "$(echo "${CI_COMMIT_KEY}" | base64 -d)" > ~/.ssh/id_rsa
    set -x
    # ^ ENV Variable must be created using `cat private_key | base64 --wrap 0` and added as masked into ENV variables.
    chmod 600 ~/.ssh/id_rsa

    echo "Preparing SPFCache file..."
    rm spfcache.json
    mv spfcache.updated.json spfcache.json

    echo "Preparing git commit..."
    git add spfcache.json
    git config --global user.name "dnscontrol CI Bot"
    git config --global user.email "<>" # set no email address for bot
    git config --global core.sshCommand 'ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' # its only internal network anyway
    git remote set-url --push origin "git@${CI_SERVER_HOST}:${CI_PROJECT_PATH}.git" # force to use SSH

    echo "Committing and pushing updated SPFCache..."
    git commit -m "[BOT] Updated SPFCache" spfcache.json
    git push -o ci.skip origin HEAD:${CI_COMMIT_BRANCH}
fi

exit 0
mdallaire commented 1 year ago

I just wanted to share my approach to the spfcache.json problem in CI. I run on Gitlab and instead of committing the file to the repo I use the cache functionality to persist the spfcache. The test stage ensure that the dnscontrol.js is valid and also update the spfcache.json file if needed. The cached version is then reused in the deploy stage.

That being said, even though spfcache.json is not a pain point for me I would not be sad if it is removed.

Here is a short version of the .gitlab-ci.yml I am using:

stages:
 - test
 - deploy

test:
    image:
        name: stackexchange/dnscontrol:v3.20.0
    stage: test
    cache:
        - key: spfcache
          paths:
            - spfcache.json
    script:
        - dnscontrol check
        - |
            if [ -f "spfcache.updated.json" ];
            then 
                mv spfcache.updated.json spfcache.json && dnscontrol check;

            fi

deploy:
    stage: deploy
    image:
        name: stackexchange/dnscontrol:v3.20.0
    cache:
        - key: spfcache
          paths:
            - spfcache.json
    needs: [test]
    rules:
     - if: '$CI_COMMIT_BRANCH == "main"'
    script:
        - dnscontrol push
cafferata commented 1 year ago

From the first moment I read the documentation I asked myself, is this a real-world problem that we have never experienced? Then I thought, okay fine, let's apply this method. As far as I'm concerned, the removal of the spf cache would not cause any problems for us.

martinstadelmann commented 11 months ago

Another caveat is that the cache is per dnsconfig.js including the SPF_BUILDER(). So having multiple configs in different folders (e.g. foo/dnsconfig.js and bar/dnsconfig.js) will not work if executed from the same root (with the --config flag) since the spfcache.json is expected to be in the current working directory. Unlike the config's location there doesn't seem to be a flag to change this either.

An option to ignore (or only warn) if an included DNS server is down and not writing a cache file would be welcome. IMO it could be removed altogether as well.