Cache could contain sensitive data

philandstuff commented 4 years ago

This is perhaps a vague issue but I think it's worth raising. In short, I think the semi semantic cache has a hazard where it will store sensitive data such as passwords or keys.

Generally when I'm dealing with a configuration management tool (which is most of my use of YAML), I want to be able to inject secrets in certain places (cryptographic keys, passwords, access tokens, etc). I also want to have control over whether and where those sensitive data items are stored. Typically, sensitive data items can be introduced to a dhall program as an import, normally environment variable or local file.

Prior to the semi-semantic caching (#1098) we would only cache imports that were frozen, and typically you would not freeze these sensitive items. However, since we introduced semi-semantic caching, we cache everything (if i have understood correctly). This means that your $XDG_CACHE_HOME/dhall-haskell directory could gradually acquire plaintext secrets, even if your normal process is to store these secrets encrypted, for example in a KeePass store.

I think @Gabriel439 made some interesting points in this comment: https://github.com/dhall-lang/dhall-haskell/issues/1098#issuecomment-511883292 where he distinguished between "securing imports" and "caching imports"; previously we used import hashes to achieve both purposes. I guess my issue is that sometimes caching imports is undesirable if the import contains sensitive data, and we no longer have a way to say "please don't cache this".

Gabriella439 commented 4 years ago

@philandstuff: Would locking down the permissions of the ~/cache/.dhall{,haskell} directories to something like 0700 permissions be sufficient protection?

philandstuff commented 4 years ago

That would help but I don’t think it would be enough in all use cases.

At my employer, we prefer not to keep secrets unencrypted on developer laptops as much as possible. We use tools like pass and sops to ensure that secrets are not unencrypted on disk; and we have deployment pipelines that fetch from the password store and pass the secret into the deployment (eg via https://github.com/camptocamp/terraform-provider-pass ).

If we introduced dhall at work for doing our deployments, and our secrets started appearing on file system unencrypted, that would be a regression from our current security practices. (Note: we don’t currently use dhall at my employer so this is somewhat moot)

I guess my preference would be some combination of the following:

don’t cache imports without hashes (probably not ideal for other reasons)
don’t cache environment variable imports specifically
don’t cache local file imports
have a way of explicitly disabling caching for a particular import (not my favourite option because people will forget)
thoroughly document that the cache exists and that there is a hazard

I think my favourite options here are disabling caching of environment variable and possibly also local imports. My assumption for environment variables is that they are disproportionately likely to be sensitive and disproportionately unlikely to benefit from caching. I think something similar is true for local imports - what benefit is there for reading from the cache rather than from another local file on the file system?

Thoughts?

ari-becker commented 4 years ago

@philandstuff: Would locking down the permissions of the ~/cache/.dhall{,haskell} directories to something like 0700 permissions be sufficient protection?

@Gabriel439 I think that your underlying argument is, if 0700 is sufficient to protect SSH and GPG private keys, then shouldn't that be sufficient for protecting other secrets?

But: a) there's a difference between a user having access to unencrypted credentials that identify the user personally, and a user having access to credentials that do not identify the user but which nonetheless the user is authorized to access, and b) there is a gradation of protection of credentials, from unencrypted cleartext on disk in 777 (basically unprotected) to 700 and possibly related SELinux protections, to encrypting the secrets on disk, to storing cleartext secrets in memory but never persisting them to disk, to encrypting secrets in memory, to keeping data encrypted even within the CPU (i.e. homomorphic encryption, which is basically still an open area of research). An org will come up with a level of security which becomes their standard, and their tools need to meet that standard.

So I think the more fundamental question is, what standard should Dhall aspire to support (at least at this stage), given that higher levels of security have a real engineering cost.

philandstuff commented 4 years ago

Thanks @ari-becker, yes that is broadly in alignment with my thinking. I would add:

even when GPG keys are on disk in a 700 directory, they are also typically protected by a strong passphrase; this is not true of the dhall cache
you can (and I do) store your GPG key in a hardware module like a yubikey via OpenPGP or PKCS11 protocols, thereby keeping it off the file system entirely

Gabriella439 commented 4 years ago

@philandstuff: Note that dhall freeze does not freeze environment variables or local files by default, because:

They are usually volatile and don't make sense to pin to a specific snapshot
They don't give a performance benefit to cache them

ari-becker commented 4 years ago

@Gabriel439 we rarely use environment variables in our codebase, because if the environment variable is embedded within the body of a function, the function will refuse to normalize if the environment variable is not defined. Considering that we ship functions between environments (and I imagine that to be a fairly normal usecase - the ability to get type-safe output in each environment based on what changes in each environment is a large part of the value proposition of Dhall in the first place), this is unacceptable. Therefore we generally have secrets be parameters to Dhall functions, and the wrapping Bash scripts that call the dhall executable handle extracting the secrets from environment variables and passing them into the Dhall functions as parameters.

philandstuff commented 4 years ago

@ari-becker oh, neat! That sounds like a good workflow that would satisfy my needs. Maybe there’s a way we could document this pattern?

ari-becker commented 4 years ago

@philandstuff I wouldn't call it ideal, because its the kind of pattern which led to the request #1294 , which deals with censoring console output. Censoring the cache, so to speak, is a different issue, particularly if secrets from intermediate normalizations are being saved to the cache as the script progresses.

I'm not sure how much there is to document? Functions are valid Dhall just like anything else - instead of saving a record in a file with a .dhall extension, you save a function in a file with a .dhall extension, and write in your script something to the effect of dhall text <<< './my-function.dhall ./my-parameter.dhall ./another-parameter.dhall' > ./output.txt.

ari-becker commented 4 years ago

@philandstuff sorry, within the context of this conversation:

\(secret : Text) -> "Secret Key: ${secret}"

# presume a global environment variable exists called MY_SECRET
dhall text <<< "./my-function.dhall (\"${MY_SECRET}\")" > ./output.txt

If MY_SECRET is abcdef then the contents of output.txt then become Secret Key: abcdef. Note that bash requires you to escape the quotation marks that Dhall uses to parse strings, and that if you're using output from elsewhere, I would generally consider it to be good practice to surround the input with parentheses to ensure that parsing happens correctly (short story - one of our scripts determines whether or not a build is a PR and passes the PR number into the function if it exists, so the parameter is of type Optional Natural, if there is a PR then it gets set to say Some 3 and on a release build it gets set to None Natural. So if in the one-liner above, there weren't parentheses, it would be substituted as ./my-function.dhall None Natural, which wouldn't parse).

Gabriella439 commented 4 years ago

@ari-becker: Is there any reason you are not doing:

$ dhall text <<< './my-function.dhall (env:MY_SECRET as Text)'

ari-becker commented 4 years ago

@Gabriel439 specifically for the example context I gave, you're absolutely right - that would be an acceptable substitute. I'm so used to working around Dhall's intentional lack of support for variable import paths by doing, in CI:

library=$(realpath "$1")
dhall <<< "(${library}/my-function.dhall) (\"${GLOBAL}\")"

and having some other script for local purposes call with the correct directory on my local machine. So because I already had to pull out ${library} I just automatically used the same pattern for ${GLOBAL}.

The main case we actually have where there would be an issue is that we do break up the computation into multiple dhall calls. For example, we have a typical pattern with a type hierarchy, EnvParams -> Metadata -> Rendered, where Rendered is the final set of (a large number of) configuration files, Metadata holds information which is common to everything in Rendered, and EnvParams holds inputs for Metadata to make sure that Metadata holds the correct common data for that environment. So we have at least two functions that are called - an EnvParams -> Metadata function, and a Metadata -> Rendered function, which are generally chained.

However, in CI, we exploit Dhall's strong normalization by following the following pattern:

env_params='{ example = "parameters" }'
echo >&2 "Checking EnvParams..."
env_params=$(dhall <<< "(${env_params}) : ${library}/EnvParams.dhall")
echo >&2 "Generating configuration..."
config=$(dhall-to-json --omitNull <<< "${library}/render.dhall (${library}/metadata.dhall (${env_params}))" | jq -c '.')

If env_params was instead something like { example = env:GLOBAL as Text }, then its status as an environment variable would be understood in the first dhall invocation, but then it would be normalized, and could be persisted to cache as part of the second call to dhall-to-json (and sometimes we have multiple libraries, with mutiple layers, so there's often more than a couple of dhall calls per CI build).

We intentionally separate out the call which checks EnvParams because it makes it clearer to the user that, if there is an error, that the error is with the user-provided inputs and not with the underlying functions.

The other case I can think of (which we don't have) is if the CI were to write a secret to "disk" (bearing in mind that the disk here is the ephemeral layer of the build container, which is defacto RAM), e.g. an SSH key, with the understanding that this secret would be wiped away with the rest of the container when the build is finished. If, however, we configure a directory to be a cache, then that directory is persisted to an actual disk, and if we were to use that secret as part of a Dhall expression, then it wouldn't be an environment variable and it would end up persisted to the cache.

sjakobi commented 4 years ago

This seems like a pretty important issue to me! I think we should implement some of @philandstuff's suggestions:

don’t cache environment variable imports specifically

…and imports that depend on environment variable imports transitively.

Maybe it would already be enough not to cache env:X as Text imports though?!

thoroughly document that the cache exists and that there is a hazard

:+1:

I think we should also document best practices for working with secrets.

Gabriella439 commented 4 years ago

Generally, if something is a security vulnerability, I'd prefer to amend the standard to plug up the vulnerability rather than rely on documentation. I would like Dhall to be as secure as reasonably possible even in the hands of inexperienced users.

sjakobi commented 4 years ago

Generally, if something is a security vulnerability, I'd prefer to amend the standard to plug up the vulnerability rather than rely on documentation.

I agree but I don't understand how it applies in this case where the primary problem is with the non-standard semi-semantic cache.

Gabriella439 commented 4 years ago

@sjakobi: I'm also fine amending the Haskell implementation to secure the semi-semantic cache, too

ari-becker commented 4 years ago

Generally, if something is a security vulnerability, I'd prefer to amend the standard to plug up the vulnerability rather than rely on documentation. I would like Dhall to be as secure as reasonably possible even in the hands of inexperienced users.

If so, then I would think that the best way to deal with this is probably to to add a keyword like sensitive which can prepend any value - if a value is marked as sensitive then it should not be cached (attempting to freeze the hash of a sensitive value should be illegal) and its output should be stricken from error output (which would make this more useful, more targeted, and more implementation-independent than the --censor flag added in #1294 ).

philandstuff commented 4 years ago

That’s an interesting thought. I also wonder if it might be worth adding at the type level - sensitive "foo" : Sensitive Text - and then when an application wants a password in its record of configuration values, it can specify that the password is sensitive but the username is not. A Sensitive value would explicitly not have a CBOR representation- and so could not be cached - and could be censored by default on output.

This is very much a thought experiment at this stage, it might be a bad idea for other reasons.

ari-becker commented 4 years ago

Agree on it being a thought experiment at this stage.

One question I think it would raise is, how would sensitive work with networked expressions? If sensitive cannot be cached, then referring to it over a network would mean that a networked expression with an embedded sensitive value is uncacheable. If a user attempts to dhall freeze such an expression, should it be an error? A warning? Should an expression be suffixed with sha256:uncached or similar to indicate that the user expects dhall freeze to leave the expression alone? How do we communicate this to the user so that the user isn't surprised by the performance hit?

Definitely needs more people's opinions weighing in.

dhall-lang / dhall-haskell

Cache could contain sensitive data #1535