ansible-collections / community.hashi_vault

Ansible collection for managing and working with HashiCorp Vault.
https://docs.ansible.com/ansible/devel/collections/community/hashi_vault/index.html
GNU General Public License v3.0
81 stars 60 forks source link

Support "caching" token after initial login #104

Open dsafanyuk opened 3 years ago

dsafanyuk commented 3 years ago
SUMMARY

To avoid multiple logins, it'd be great if the lookup module supported a token cache after initial login

ISSUE TYPE
COMPONENT NAME

lookup

ADDITIONAL INFORMATION

Currently, it seems for multiple lookups, each lookup does log in + reads from vault.

Would you be open to supporting a feature that adds a token cache?

Here's an example of a now deprecated module that had this feature: https://github.com/jhaals/ansible-vault/blob/17fdb6fde3cb441bc7737770591b5ab5c79a3fe3/vault.py#L100-L108

dsafanyuk commented 3 years ago

EDIT: in the mentioned module, seems like it may not have worked: https://github.com/jhaals/ansible-vault/issues/61

briantist commented 3 years ago

Hi @dsafanyuk , thanks for opening this feature request!

As it happens, this is something I've been thinking about for quite a while. It's definitely not ideal to be doing logins for every retrieval. So I'll briefly say that this will be addressed in some form or another, though it may not be exactly what you had in mind.

Workarounds

First I'll talk about what can be done today to workaround this situation.

Because the plugin supports token authentication already, if you acquire your token outside the plugin, you can (re)use the plugin with that token.

One example would be using the Vault CLI to execute a vault login, and have the token written out to the default token helper file location (or a custom location, as we can control where the plugin looks for a token file).

The call to vault login could be done outside of the Ansible run, or it could be done within Ansible, with ansible.builtin.command, ansible.builtin.shell, or the pipe lookup.

With those latter options, there's also the possibility of using -format=json to programmatically retrieve the token within Ansible and pass it to the plugin directly (to avoid using the token sink file).

Another workaround is using Vault Agent with auto-auth, and pointing the plugin to contact the agent rather than the Vault server directly. A recent contribution (#80) enabled this functionality via the none auth type.

Vault Agent can also be used with token auth, by pointing token_path and token_file to the agent's token sink.

Short(-ish) Term Plans

I'm still working through reorganization of the internals, but the goal is that this collection won't be just this single plugin.

And the the first additional plugin(s) and/or module(s) will almost certainly be centered around doing auth only, and returning a token.

That will allow selectively controlling auth, and the resulting token use and lifecycle, separately from Vault operations (and it won't require the Vault CLI/binary).

What's the hold up?

Moving the pieces of the existing code, most of which revolve around auth (and connections), into shared libraries so that 1) it's not duplicated around the collection, 2) it behaves consistently within the collection, and 3) each part can be tested independently without repeating tests or strongly coupling tests for different content.

Long Term

Various other ideas might be possible, for example an option that allows for persisting auth after it's done, in any of the plugins (writing it to a file), or perhaps using a token helper (#91) more generally. It's too early to say for sure, but to me this sounds like a good option to have specifically on an auth plugin/module, and perhaps not on any plugin that does "integrated" auth, we'll see.

I've also considered adding a vars plugin that might be better suited to consolidating vault access by managing its own auth and/or token lifecycle, while an Ansible author need only define what they need upfront, and then can reference secrets as needed as ansible vars. A lot of research and testing will be needed for this, but I'll be taking some inspiration from community.sops.sops. Similarly, that sops collection has other interesting ideas like the action plugin.

I'd also like to look at ways that token lifecycle could be handled, either manually or transparently, throughout an Ansible run, to account for token ttl, renewal, etc. But that too will require a decent amount of research and testing. I think that will become clearer as the other pieces fall into place.

Difficulties

There are a few things that make caching problematic, some of which have been talked about above. One thing we have to keep in mind is that tokens are secrets too, so we have to be somewhat careful about how and where they are cached, to avoid exposing them unintentionally, and critically, giving the user a lot of control over how that happens.

In the case of controller side content (plugins/actions) that can be somewhat straightforward, while in modules, which execute on the remote side, there are other issues to consider, like whether a cache system only works on the target host, or whether it exfiltrates the token back to the controller; stuff like that.


I'm still interested in hearing your thoughts on the above, and any other ideas you want to discuss. Thanks again!

dsafanyuk commented 3 years ago

Hey thanks for the explanation! I can go into a bit into our scenario:

We're currently using AWX(15.0.1) + custom credential types that inject the approle credentials as envvars. I can't share code but here's the tutorial we followed: link

We previously did use a plain Vault Token that we retrieved by auth-ing using approle credentials. I am trying to steer away from this because we had to have a lot of automation surrounding this (renewing the token, updating the AWX credential with a fresh credential, etc.)

/scenario

One solution to my specific problem was forking this collection and adding a sleep between authorizing and reading the variables. This seems like a bandaid, and i'd rather help contribute a solution that everyone can use.

Another option was to use pre_task to log into vault using the approle creds, then use the resulting token as a variable to community.hashi_vault. Not sure if this makes sense but something like:

# pre_task
- name: get a vault token
  set_fact:
    ansible_vault_token: "{{ lookup('community.hashi_vault.hashi_vault', ' auth_method=approle role_id=myroleid secret_id=mysecretid') }}"
   # notice the lack of "secret=foo"

# or

- hosts: localhost
  vars:
    ansible_vault_token: "{{ lookup('community.hashi_vault.hashi_vault', ' auth_method=approle role_id=myroleid secret_id=mysecretid') }}"

  tasks:
    - name: debug
       debug:
         msg: "{{ lookup('hashi_vault', 'secret=secret/foo:value') }}"
briantist commented 3 years ago

Thanks for adding some explanation!

For your AWX setup (using env vars) you may want to take a look at #49 and #86 as some of the env vars you are using may be deprecated, and AWX doesn't allow the new env vars (those beginning with ANSIBLE_). Use of the ansible var injection version of AWX credential types ought to work better.

We previously did use a plain Vault Token that we retrieved by auth-ing using approle credentials. I am trying to steer away from this because we had to have a lot of automation surrounding this (renewing the token, updating the AWX credential with a fresh credential, etc.)

If I understood this correctly, it sounds like someone manually logs into Vault with AppRole auth to retrieve a long-lived token, and that token is then used until it expires (or rather than manual, this was what all of the unwanted automatio was doing).

That of course is less than ideal, as a token is meant to be ephemeral.

One solution to my specific problem was forking this collection and adding a sleep between authorizing and reading the variables. This seems like a bandaid, and i'd rather help contribute a solution that everyone can use.

I may need more explanation of this, I'm not following how adding a sleep would help?

Another option was to use pre_task to log into vault using the approle creds, then use the resulting token as a variable to community.hashi_vault.

I think this is similar to one of the workarounds I mentioned, which is to call vault login within a task. Though it doesn't necessarily need to be in pre_tasks, it could just be a task in tasks that comes before the use of the lookup.

In your example though, with "notice the lack of secret=foo" I don't understand how that could work as the existing lookup never returns a token. Was that meant to be a feature of your fork?

It would however fit the newer plugin I was mentioning which could do auth and return a token, that's exactly what I had in mind.

dsafanyuk commented 3 years ago

If I understood this correctly, it sounds like someone manually logs into Vault with AppRole auth to retrieve a long-lived token, and that token is then used until it expires (or rather than manual, this was what all of the unwanted automatio was doing).

correct

I may need more explanation of this, I'm not following how adding a sleep would help?

Our specific issue is that our Vault instance cannot replicate the token to the standby nodes fast enough. So the team that manages Vault has observed that by putting in a delay between auth/reads, it gave enough time to replicate the token across the vault cluster.

In your example though, with "notice the lack of secret=foo" I don't understand how that could work as the existing lookup never returns a toke... It would however fit the newer plugin I was mentioning which could do auth and return a token, that's exactly what I had in mind.

Yes, i'd love that feature!

briantist commented 3 years ago

Our specific issue is that our Vault instance cannot replicate the token to the standby nodes fast enough. So the team that manages Vault has observed that by putting in a delay between auth/reads, it gave enough time to replicate the token across the vault cluster.

Oh that's interesting. Have you engaged with HashiCorp support by any chance? I am wondering if this would be considered an issue related to the cluster configuration, or if such a delay is expected to be handled on the client side, or something else.

BTW you might benefit from the latest feature added, which is retries, see #71 , along with the docs for the retries option and the newly minted User Guide which goes into some more detail on the feature.

I suspect this would solve (or at least, workaround) that issue, as the login request would succeed, and then the failing secret read would be the request that is retried. You may need to adjust the retry parameters in terms of backoff, number of retries, perhaps even changing the HTTP status codes that are retried, to fit your particular situation, however the retries option was built with all that in mind.

I'd love to know if that works for you.

briantist commented 2 years ago

@dsafanyuk I know it's been a while but I wanted to give some updates.

Although we don't have built-in caching for the login from one plugin call to another, we now have the vault_login lookup plugin and module which can be used to explicitly perform a login and return a token, which you can then use in subsequent calls. The examples show usage.

We also have the none auth method, which can be used in combination with a local Vault Agent that's configured for auto-auth. With this setup, the agent can be configured to accept HTTP requests and inject the auto-auth token it maintains into any requests it proxies to Vault.

I'm using this setup in production with (Ansible + none auth) + (Agent + auto-auth) and it works very nicely.

I also want to let you know that we did in fact encounter a token replication issue in production just as you described. We did ultimately work that problem out (underlying storage performance was not up to par), but what's notable is that our entire Ansible fleet largely did not notice it, because we had it configured to perform connection retries.

Because this options works at the connection level, we were not stuck in a loop of retries performing new logins; it's the subsequent requests that were retried with the same original token, so they eventually worked and all it did was slow down our Ansible runs a little and spit out some warnings.

That being said, HashiCorp also introduced some new eventual consistency mitigations, and I have an open issue (#170) to see if it makes sense to implement anything like that within this collection, or in hvac, or both.

One other requested feature was support for token helpers (#91), but it does have some of the same difficulties as implementing central caching.


I'd be interested to know how you're getting along with Vault and Ansible lately.

dsafanyuk commented 2 years ago

Hey @briantist ! I actually moved from my company and we don't use Ansible/Vault at my new one :/

I do appreciate the write up; we can close this issue if you think it's prudent.

briantist commented 2 years ago

Thanks for the quick response @dsafanyuk ! I think I will leave it open for now for the sake of others, but I may close it in the future. Feel free to unsubscribe in case you'd like to avoid future notifications. Thanks for opening it; take care!