thaumos commented 7 years ago

Proposal: Standardize usage of _facts

Author: Dylan Silva <@thaumos> IRC: thaumos

Date: 2016/03/01

Status: New
Proposal type: Core Design
Targeted Release: 2.4
Associated PR: https://github.com/ansible/ansible/pull/20399
Estimated time to implement: <X days, weeks, etc.>

Motivation

There are many modules that have implemented different approaches to getting information about an object. Some users are starting to overload state= to gather information. In other cases, _facts modules are used to gather data. There are also lookup plugins.

Define the standard approach to this. So for any new PR's, there will be a single approach to this.

Problems

What problems exist that this proposal will solve?

Solve the inconsistency in Ansible for the one true way of approaching creating, updating, deleting, and providing information of objects.

There are also misconceptions how *_facts modules are related to ansible_facts.

Solution proposal

There are multiple approaches to this. We need to come up with one way. Listed are all the discussed options.

Continue to use *_facts modules for info gathering
_info modules
Make lookup plugins the standard for gathering info.
state= such as (list|current|query)
query=true|false

Documentation (optional)

Update documentation to reflect the new decided standard.

Anything else?

This proposal is meant to solicit discussion and come to an agreement.

After the decision is made work with community in adopting new standard and towards migrating existing approaches to the new standard.

thaumos commented 7 years ago

I am for a lookup plugin based approach. This can be used per task rather than having an explicit task to gather facts to be used later.

sivel commented 7 years ago

I think the problem is that there is not a 1 size fits all approach. The history behind this is roughly adding list or query functionality to existing modules.

In many cases, I expect creating a new _facts module is the correct decision, since we need remove execution. Example: apt_package_facts or apt_facts
Where the data is able to be queried remotely, not requiring remote execution, a lookup is the best. Example: lookup("aws_lambda_functions")
In the example where it is not facts about the host doing the querying, but requires you to use remote execution, a new _info module should be created. Some contrived example, you need to query an app only accessible from specific machines, to get some arbitrary non-host information. Example: archive_info, git_info

We have to account for a number of places where remote execution is needed, but not host facts, due to places where there is limited access to the service/system that is being queries.

dagwieers commented 7 years ago

I think we need a broader approach, as there is also a need to make the facts-gathering process modular. And I think this fits into the same need. We need a pluggable facts-gathering, that is also independently addressable (so some facts plugins you like to trigger in advance during the facts_gathering stage, other facts plugins you may wish to run on-demand, possibly with options)

bcoca commented 7 years ago

So I mostly agree with @sivel

_facts modules should be used to gather 'info associated to a host' from/for remote systems
pure db/api queries should use lookups instead
for 'non host specific' information we can add _info as the standard way of collecting data.

bcoca commented 7 years ago

@dagwieers different discussion, this is on a standard on how to establish a pattern for modules to gather information.

refactoring gather_facts to be modular would be a different proposal

dagwieers commented 7 years ago

@bcoca Disagree, it has the same purpose, getting information from the system. Some stuff you want in the default namespace, other stuff you may want to register. A proper modular approach could have a single interface for everybody's needs. Whether it relates to any object (system, network, application, vm, vm snapshot, partition, ...) should not matter.

sivel commented 7 years ago

I'm in agreement with @bcoca, this is a proposal discussing the approach of effectively doing list or query operations. This likely doesn't involve any change to existing code, at least not immediately, instead it is just coming to an agreement on naming and how to implement going forward.

Making a change to the fact gathering step, to make it pluggable, should be a separate proposal.

dagwieers commented 7 years ago

It seems pointless if the other proposal would get rid of _facts or _stat modules altogether. But have it your way.

bcoca commented 7 years ago

@dagwieers then make that proposal, it is 'related' but still not the same as this one. If it makes this one obsolete, so be it, but lets not hijack this discussion, make the proposal.

tima commented 7 years ago

In general, I agree what @sivel said. I don’t believe there is one solution though I think we can cut down on the number we have in the wild. I think clarifying the use cases will help a lot here.

I really dislike the idea of state and query params in modules and support killing those practices as bad form.

I like the proposed _facts (facts that are host specific) & _info (not host specific) approach.

On the topic of “info” modules, I’ve always wondered where is the right place to store facts/info that are not host specific. That was an issue I raised over a year ago to the core team after a long engagement at a client doing a lot with AWS. They were repetitively storing vpc, security group etc data about the cloud environment in every host. While it worked it made for a lot of duplication and unnecessary processing and resource consumption.

Like @dagwieers I think we should be considering what this may mean to making the facts-gathering process modular. There needs to be a better and declarative way to define which fact modules need to run rather than individual tasks. Calling that discussion "hijacking" here is a bit harsh.

One area I disagree with @sivel is his reasoning for point 2 and the use of lookups. The seperation by local or remote execution seems like an implementation detail that is irrelavant. To me lookups are best suited for real-time access of specific pieces of data. “Facts” without remote execution can be implemented in a module easily and should be mostly transparent to an end user. I haven't been following all the work on the internals lately -- has such a non-host-specific fact store been added?

This all said, I’m not a fan of lookups in general because they’ve always felt like a kludge. They get too close to programming constructs bleeding thru into the declarative space of playbooks for my tastes. To me _facts and _info modules are more Ansible like IMHO – course grain and declarative than the lookup mechanism.

sivel commented 7 years ago

To be fair, I'm not a huge fan of lookups either. Other than for the simple use cases they provide for with_ loops. But many others seem to like them, so after discussion, and understanding what some see their use as, I included it in my run down.

sivel commented 7 years ago

Also, and because I can't get all my thoughts together at once to include it in one comment...

I think the heart of this is really how to implement list or query operations where people would normally think to make those a status of a pre-existing module.

How we go about making the fact gathering step more pluggable, although related, I don't think is necessary for this discussion, but I may be missing something there.

tima commented 7 years ago

In speaking to a lot of users in the field I've noticed there is a close correlation between people who like lookups and people who want to write playbooks in python. ;D

I've always been under the impression there is a lot of confusion with developers in the space of facts/metadata/info gathering -- when to pull the different levers -- where to store what comes back. The list and status issue is just one symptom of that overall.

thaumos commented 7 years ago

@dagwieers, I think the point that @sivel and @bcoca are trying to make is let's talk about this first for now, and work towards a better approach later.

We all want to make a better system, but for now let's agree on an approach for the interim.

_facts is the what the community is used to. We're starting to see an overload of parameters, because the current approach is broken in some way to the user, or they're just not aware of what options they do have because it's not documented.

I personally like lookups because I feel they are more flexible, and when it comes to anything outside of a host, it's a remote call to something. That's why _info doesn't make as much sense to me. They're pretty much the same in my book. _facts have always been host specific to me.

That all being said, I think @bcoca's list is the approach I'd like to take. I added my thumbs up to it. Feel free to +1 or -1 it.

bcoca commented 7 years ago

@tima as for 'where to store', currently when returning from a task you can 'register' a var, this is still 'host specific', the advantage of lookups is that you can setup a var at any time:

vars:
   ec2_sec_groups: "{{lookup('ec2', 'object=security_groups')}}"

The modularization of fact gathering is orthogonal to this discussion as this pertains to the 'pattern' we want modules returning information to follow (facts vs info vs query/list states), the other deals with making 'gather_facts' more user customizable.

tima commented 7 years ago

Perhaps I don't fully understand your answer about where to store info @bcoca. So I need to get the info about the VPC in AWS that I'm running in -- I have to remember to stick a register on that task? I also have to remember that name of the variable and make sure any other tasks/roles etc are using that same arbitrary variable? We can't so better for the end user and let the module automatically store that info in some standardized place? Really, what use is such a module if I'm fetching and storing it for later use?

That said, I'm fine with @bcoca's suggestion (which is just about the same as @sivel's?). Seems like we are in agreement that state=status or query need to be killed?

sivel commented 7 years ago

I don't know that I was explicit in calling this out, however the approach we discussed in the meeting was that the use of an _info module, would operate like any other module, and have no special internal handling. Should you want to store that data for future use, it would involve use of register.

That is really no different than the use of a list/query state. This is us effectively just saying that list/query are not states, and as such, those things should live as separate modules.

If it is a host fact, then a _facts module, returning ansible_facts is fine.

tima commented 7 years ago

I did get that @sivel. I was calling out this shortcoming I've experience working with users for a few years now. That's where the AWS VPC example came from -- "facts" that don't belong to a specific host but the environment/stack that are stable during a playbook run. It always seemed clunky and unnecessary work for end users that they often getting wrong -- like retrieving and storing the same info in each host's facts/variables.

bcoca commented 7 years ago

@tima are you suggesting that _info use a new 'auto merge' namespace as _facts does? ansible_info.var_from_module ?

tima commented 7 years ago

I'm suggesting that there needs to be a standard place to store non-host specific facts that _info modules can automatically stash their return values like a module that returns ansible_facts. How you manage/merge/munge those under the hood -- I don't have a specific opinion.

sivel commented 7 years ago

I am a big -1 on _info auto stashing data. My preference is sticking with register.

Otherwise, lets just use _facts, since there wouldn't be any difference in functionality.

I am a huge fan of being explicit, and I feel like we often get too implicit in implementations. We've gone over things like this before, where it was recommended that we 'auto' stash every module return, and making it available as _ or something, and we ultimately turned it down.

tima commented 7 years ago

I am a big -1 on _info auto stashing data. My preference is sticking with register.

Can you elaborate on why? Why would your typical end user not want that? (Note: this elite team of contributors is not your typical end user of Ansible)

Auto stashing every task return is excessive that I agree with not implementing that. But an _info module? What other purpose does that module provide then to return information to be accessed later? How would auto-stashing _info be inconsistent with _fact modules that we'd use a different implementation?

sivel commented 7 years ago

My opinion here is that auto stashing actually makes things more complicated. Imagine coming to look at a new playbook, and trying to find where a variable is set, and not being able to find it. Being less explicit, makes playbooks less readable.

How would auto-stashing _info be inconsistent with _fact modules that we'd use a different implementation?

I don't think I said that. I said I want it to be inconsistent with _fact modules. Otherwise, let's just use _fact. Why add another thing that serves the same purpose as an existing thing for sake of a new name?

I try to apply "The Zen of Python" to most of my decisions, as it makes a lot of sense to me pretty much everywhere. Based on this, auto-stashing feels to be in opposition to my general stance.

tima commented 7 years ago

My opinion here is that auto stashing actually makes things more complicated. Imagine coming to look at a new playbook, and trying to find where a variable is set, and not being able to find it. Being less explicit, makes playbooks less readable.

By that logic then we should deprecate auto-stashing of facts. This is exactly the same learning curve every user goes thru learning what and how to utilize facts. Trust me, I'm out in the field teaching new users of Ansible all the time now.

I don't think I said that. I said I want it to be inconsistent with _fact modules. Otherwise, let's just use _fact. Why add another thing that serves the same purpose as an existing thing for sake of a new name?

I thought the difference that we defined between _facts and _info modules is that one set of information is host-specific while the other is not. It is not how they are stored.

I try to apply "The Zen of Python" to most of my decisions, as it makes a lot of sense to me pretty much everywhere. Based on this, auto-stashing feels to be in opposition to my general stance.

This is what I meant when I said we here are not the common Ansible user. Most users of Ansible are not python programmers. Hell a significant and fast growing number aren't programmers at all. I appreciate the zen of python, but let's lose sight of our user base and where it's best applied.

sivel commented 7 years ago

I empathize with your points, but I fundamentally disagree with them. I'm going to let others respond here. My view points have been recorded.

bcoca commented 7 years ago

IF we do autostash, it MUST be namespaced, otherwise we end up with 'top level pollution as we have with facts currently.

Doing this on a 'non host' basis is a bit confusing as plays iterate over host ... so we execute the task for each host but then use global scope? what happens when hosts get conflicting data? i.e hostA is in us-east-1, hostB is in us-west-1, security group info is different or do you want _info to automatically bypass the host loop? then we have the opposite issue ...

bcoca commented 7 years ago

@dagwieers also you might be more interested in https://github.com/ansible/proposals/issues/17 , which already assumes a 'user defined' fact gathering settable per host

tima commented 7 years ago

@bcoca: I agree with these "info facts" being namespaced. That's been a given in my mind to avoid the exact sort of conflicts you jumped in with. The "info" is auto stashed/merged/updated in the global (I have no specific preference on the name) namespace and that one alone. If a user wants to access those global facts/info they specifically ask for them like they would the facts of a different host. They would not be merged and conflicts resolved with host facts by the system. That's treating "info" and "default facts" and that would be messy. That's not what I'm thinking here -- just a place to stash information in bulk discovered by modules that is not host-specific but rather utilized by all.

bcoca commented 7 years ago

but that is the basic issue, task/actions are always host specific

tima commented 7 years ago

Understood and correct. So how are we recommending facts that don't belong to a host (for instance some resource in a virtual/cloud environment) get retrieved and potentially stored by an end user?

They use lookups per host per fact -- "info" modules being banished from practice.
They use an "info" module that is run that the user registers the return per host. They use lookups for querying databases and registries etc in real-time.
They use an "info" module because they are just fact modules and get stored/merged into the facts of each host or they elect to stash them one arbitrary host like localhost. This is the status quo. lookups the same as 2.
They use an "Info" modules that are run once and are stored in a globally accessible place (namespaced but not belonging to a specific host) for all hosts to utilize. Lookups same as 2.

Did I miss an option?

bcoca commented 7 years ago

@tima, the 'non autoregister' combinations, this seems to work fine with some modules, like find.

We also have 'non _facts' modules returning facts, like getent or modules returning facts as a side effect i.e. docker_container.

I would add the category of 'allowed to add facts' as a side effect of modifying host, most 'create cloud instance/resource' would qualify.

tima commented 7 years ago

I don't mean to be obtuse, but I'm not following your point.

find is a query and host-specific. I'm not familiar with genet but that seems to be host specific also. (The advocates of lookups could argue those shouldn't be modules I suppose.) Stuff like VPC, security group, an ELB is not.

Here is an example from the docs:

- ec2_vpc_subnet_facts:
    filters:
      vpc-id: vpc-abcdef00
      "tag:Name": "{{ item }}"
  with_items:
    - publicA
    - publicB
    - publicC
  register: subnet_facts

This task will be run for each host in the play and each will have a subnet_facts with a copy of the same data in it's namespace. As a user, I have to have the presence of mind to use run_once (a separate play and stash them with a set_fact on some arbitrary host) so this task doesn't waste time and resources.

I also have to remember what var name I'm using and communicate that to the rest of my team and company if that info is going to be reused like in a role. If some other developer decides to use elb_subnet_facts because they don't know about subnet_facts or TL;DR the other developer's docs.

If this module just auto-stashed it's facts in a standard place under a namespace where all hosts can access it the need for a register, run_once and coordinating the naming of variables are all non-issues.

bcoca commented 7 years ago

i was pointing out find as a module that returns host specific info (facts?) but does not autoregister, getent does similar but DOES register facts, there are many existing cases.

ec2_vpc_subnet_facts is badly named, as it does not return facts, for me this would make a good case for a lookup. As for run_once + set_fact ... i think you missed the part that run_once registers facts for ALL hosts in the play.

vars:
  vpc_abcdef00_subnets: "{{lookup('ec2_vpc', id='vpc-abcdef00', tags=....}}"

abadger commented 7 years ago

I mostly agree with @sivel but am -1 on the proliferation of lookup plugins for this. The problem is that the author of the functionality usually cannot tell whether the code will have to be executed on a remote box or not. That's something that only the site-administrator can tell. So most things should be created as modules to give the site-administrator the flexibility to run this locally or on a remote box that can access the resource.

bcoca commented 7 years ago

+1 for adding lookups, but not for removing modules, as many have mentioned, the controller is not always the machine that has access to query the data.

that still leaves making these _info modules and how we deal with registration of the data.

tima commented 7 years ago

@bcoca:

i was pointing out find as a module that returns host specific info (facts?) but does not autoregister, getent does similar but DOES register facts, there are many existing cases.

That sounds like a mess and exactly why we need clearer guidance. Both of these examples are queries though and not fact gathering. _facts and _info modules should be able to run without any params.

ec2_vpc_subnet_facts is badly named, as it does not return facts, for me this would make a good case for a lookup.

Agreed that is badly named or (ideally) should be changed to return facts -- another example of why clear guidance is needed. I don't agree with that being a lookup for reasons I've already stated.

As for run_once + set_fact ... i think you missed the part that run_once registers facts for ALL hosts in the play.

The run_once and set_fact clauses are an OR not an AND. You either have to remember to stick a run_once on there OR run a separate play and stash them with a set_fact on some single arbitrary host like localhost. (How do you do the latter in the context of a role?) Most users won't be that savvy and will inadvertently run unnecessarily eat up time memory with unnecessary processing and data.

  vpc_abcdef00_subnets: "{{lookup('ec2_vpc', id='vpc-abcdef00', tags=....}}"

That looks plain awful and is exactly why it should be a module. We've said expanded/native YAML syntax for our docs and in our best practices and training. This runs contrary to the reasoning we've gone that way with the user base -- better readability.

Putting that aside, wouldn't the above lookup run every time the vpc_abcdef00_subnets variable is accessed? I supposed a lookup plugin could implement their own sort of caching but do we really want to go there?

@abadger:

I mostly agree with @sivel but am -1 on the proliferation of lookup plugins for this. The problem is that the author of the functionality...

Excellent point @abadger. I feel even stronger on the point and reiterate my -1 on the proliferation of lookup plugins for this also.

sivel commented 7 years ago

Interestingly enough, over on the bindep PR the consensus was to not make it a facts module because people don't like facts modules.

tima commented 7 years ago

I can see why they wouldn't use facts there. I'm not familiar with the bindep util but reading "retrieve a list of packages that are not already installed based on..." sounds like a query though I'm a bit unsure. None of the params are required, but then again every example use one or more params.

bcoca commented 7 years ago

@tima, to clarify, find and getent are not clearly 'query' modules as they do get 'host specific info', unlike ec2_vpc_subnet_facts which returns the 'subnets available' not the ones associated to a host. But yes, this mess is WHY this thread was started.

Also a note on set_fact as a way to run a template once (won't get reevaluated on var use):

set_fact:
  vpc_abcdef00_subnets: "{{lookup('ec2_vpc', id='vpc-abcdef00', tags=....}}"

Another thing we have been thinking is a !nonlazy var definition, but that is vaporware right now.

bcoca commented 7 years ago

I've also been toying with a better 'lookup sytnax':

vars:
   myquery: !lookup
       name: consul
       args:
           usrl: consul://here/is/it
           key:  mysecret

alikins commented 7 years ago

For '_info' data (data objects that are not host specific), I like the idea of '*_info' modules. I also agree they need to be namespaced. For '_info' data, if there is a name/id/address that can be used to find the values, '_info' access would be similar to fact access, but using the '_info' id instead of the host name.

The list of _info identifiers would act much like an 'object inventory'. var manager would have a data structure for the 'info' namespace, and the keys of vars['info'] would be the _info identifiers (which I'll call the 'info_id' for now). Getting the values of a info_id from the vars 'info' data would be an analog to getting the value of a fact for a particular host.

More or less inventory+host_vars but for things that are not hosts.

module return values could auto populate the 'info' data much like facts do. A current pb/play/block/task would need to know what 'info_id' it is acting on (similar to how they need to know the target host to properly set facts). This could be set in playbooks explicitly (a 'info_objects' keyword analog to 'hosts'). For _info only tasks (tasks that use an _info action/module) the 'info_objects' value could be used in lieu of the task target hostname.

In this scenario, _info gather is very similar to _fact gathering. Once finer grained fact gathering is available, ideally _info gather would share much of the design and implementation and would also gain fine grained gathering.

[Eventually, facts and inventory hosts data etc could be considered as just more specific instances of objects in 'info', where the 'info_idenditifier' just happens to be a hostname...]

sivel commented 7 years ago

One thing I've recently thought about for auto stashing, is in regards to consecutive runs of the info module. We have to curate and ensure that every info module returns a properly name spaced return for every use. Otherwise we would potentially overwrite what we auto stashed.

tima commented 7 years ago

Probably better if that namespacing was handled by the core engine rather than relying on each developer to execute it consistently and correctly.

sivel commented 7 years ago

If the core engine namespaces, and further does so to avoid collisions between consecutive runs, wouldn't that make it very difficult for the user to know where the data is implicitly stored?

We'd end up having to return the namespace as a register-able variable so that people could find it, at which point auto stashing doesn't make much sense.

maxamillion commented 6 years ago

@abadger I'm be interested in your thoughts for what a reasonable alternative solution to lookup plugins would be.

bcoca commented 6 years ago

I was thinking that if we can get consensus on the 3 points above we can also look at including future features to make this easier for users to control:

new task keyword aggregate: no|yes which controls if result overwrites registered variables/facts or merges with existing
add new config setting that allows 'autonamespacing' by 'action' under ansible_facts or just updating ansible_facts directly. ie ansible_facts.ec2_facts.<varname> vs ansilbe_facts.<varname>

felixfontein commented 5 years ago

In the last two Public Ansible Project Meetings, it has been decided that _facts modules should always return ansible_facts, and that _facts-like modules which don't return ansible_facts should be called _info.

nitzmahone commented 5 years ago

In today's meeting, it was also clarified that these policies will be applied starting in Ansible 2.9.

ansible / proposals

Standardize on usage of _facts #56

Proposal: Standardize usage of _facts

Motivation

Problems

Solution proposal

Documentation (optional)

Anything else?