Closed thaumos closed 4 years ago
I am for a lookup plugin based approach. This can be used per task rather than having an explicit task to gather facts to be used later.
I think the problem is that there is not a 1 size fits all approach. The history behind this is roughly adding list
or query
functionality to existing modules.
_facts
module is the correct decision, since we need remove execution. Example: apt_package_facts
or apt_facts
lookup("aws_lambda_functions")
_info
module should be created. Some contrived example, you need to query an app only accessible from specific machines, to get some arbitrary non-host information. Example: archive_info
, git_info
We have to account for a number of places where remote execution is needed, but not host facts, due to places where there is limited access to the service/system that is being queries.
I think we need a broader approach, as there is also a need to make the facts-gathering process modular. And I think this fits into the same need. We need a pluggable facts-gathering, that is also independently addressable (so some facts plugins you like to trigger in advance during the facts_gathering stage, other facts plugins you may wish to run on-demand, possibly with options)
So I mostly agree with @sivel
@dagwieers different discussion, this is on a standard on how to establish a pattern for modules to gather information.
refactoring gather_facts to be modular would be a different proposal
@bcoca Disagree, it has the same purpose, getting information from the system. Some stuff you want in the default namespace, other stuff you may want to register. A proper modular approach could have a single interface for everybody's needs. Whether it relates to any object (system, network, application, vm, vm snapshot, partition, ...) should not matter.
I'm in agreement with @bcoca, this is a proposal discussing the approach of effectively doing list
or query
operations. This likely doesn't involve any change to existing code, at least not immediately, instead it is just coming to an agreement on naming and how to implement going forward.
Making a change to the fact gathering step, to make it pluggable, should be a separate proposal.
It seems pointless if the other proposal would get rid of _facts or _stat modules altogether. But have it your way.
@dagwieers then make that proposal, it is 'related' but still not the same as this one. If it makes this one obsolete, so be it, but lets not hijack this discussion, make the proposal.
In general, I agree what @sivel said. I don’t believe there is one solution though I think we can cut down on the number we have in the wild. I think clarifying the use cases will help a lot here.
I really dislike the idea of state and query params in modules and support killing those practices as bad form.
I like the proposed _facts (facts that are host specific) & _info (not host specific) approach.
On the topic of “info” modules, I’ve always wondered where is the right place to store facts/info that are not host specific. That was an issue I raised over a year ago to the core team after a long engagement at a client doing a lot with AWS. They were repetitively storing vpc, security group etc data about the cloud environment in every host. While it worked it made for a lot of duplication and unnecessary processing and resource consumption.
Like @dagwieers I think we should be considering what this may mean to making the facts-gathering process modular. There needs to be a better and declarative way to define which fact modules need to run rather than individual tasks. Calling that discussion "hijacking" here is a bit harsh.
One area I disagree with @sivel is his reasoning for point 2 and the use of lookups. The seperation by local or remote execution seems like an implementation detail that is irrelavant. To me lookups are best suited for real-time access of specific pieces of data. “Facts” without remote execution can be implemented in a module easily and should be mostly transparent to an end user. I haven't been following all the work on the internals lately -- has such a non-host-specific fact store been added?
This all said, I’m not a fan of lookups in general because they’ve always felt like a kludge. They get too close to programming constructs bleeding thru into the declarative space of playbooks for my tastes. To me _facts and _info modules are more Ansible like IMHO – course grain and declarative than the lookup mechanism.
To be fair, I'm not a huge fan of lookups either. Other than for the simple use cases they provide for with_
loops. But many others seem to like them, so after discussion, and understanding what some see their use as, I included it in my run down.
Also, and because I can't get all my thoughts together at once to include it in one comment...
I think the heart of this is really how to implement list
or query
operations where people would normally think to make those a status
of a pre-existing module.
How we go about making the fact gathering step more pluggable, although related, I don't think is necessary for this discussion, but I may be missing something there.
In speaking to a lot of users in the field I've noticed there is a close correlation between people who like lookups and people who want to write playbooks in python. ;D
I've always been under the impression there is a lot of confusion with developers in the space of facts/metadata/info gathering -- when to pull the different levers -- where to store what comes back. The list
and status
issue is just one symptom of that overall.
@dagwieers, I think the point that @sivel and @bcoca are trying to make is let's talk about this first for now, and work towards a better approach later.
We all want to make a better system, but for now let's agree on an approach for the interim.
_facts is the what the community is used to. We're starting to see an overload of parameters, because the current approach is broken in some way to the user, or they're just not aware of what options they do have because it's not documented.
I personally like lookups because I feel they are more flexible, and when it comes to anything outside of a host, it's a remote call to something. That's why _info doesn't make as much sense to me. They're pretty much the same in my book. _facts have always been host specific to me.
That all being said, I think @bcoca's list is the approach I'd like to take. I added my thumbs up to it. Feel free to +1 or -1 it.
@tima as for 'where to store', currently when returning from a task you can 'register' a var, this is still 'host specific', the advantage of lookups is that you can setup a var at any time:
vars:
ec2_sec_groups: "{{lookup('ec2', 'object=security_groups')}}"
The modularization of fact gathering is orthogonal to this discussion as this pertains to the 'pattern' we want modules returning information to follow (facts vs info vs query/list states), the other deals with making 'gather_facts' more user customizable.
Perhaps I don't fully understand your answer about where to store info @bcoca. So I need to get the info about the VPC in AWS that I'm running in -- I have to remember to stick a register on that task? I also have to remember that name of the variable and make sure any other tasks/roles etc are using that same arbitrary variable? We can't so better for the end user and let the module automatically store that info in some standardized place? Really, what use is such a module if I'm fetching and storing it for later use?
That said, I'm fine with @bcoca's suggestion (which is just about the same as @sivel's?). Seems like we are in agreement that state=status
or query
need to be killed?
I don't know that I was explicit in calling this out, however the approach we discussed in the meeting was that the use of an _info
module, would operate like any other module, and have no special internal handling. Should you want to store that data for future use, it would involve use of register
.
That is really no different than the use of a list/query state. This is us effectively just saying that list/query are not states, and as such, those things should live as separate modules.
If it is a host fact, then a _facts
module, returning ansible_facts
is fine.
I did get that @sivel. I was calling out this shortcoming I've experience working with users for a few years now. That's where the AWS VPC example came from -- "facts" that don't belong to a specific host but the environment/stack that are stable during a playbook run. It always seemed clunky and unnecessary work for end users that they often getting wrong -- like retrieving and storing the same info in each host's facts/variables.
@tima are you suggesting that _info use a new 'auto merge' namespace as _facts does?
ansible_info.var_from_module
?
I'm suggesting that there needs to be a standard place to store non-host specific facts that _info modules can automatically stash their return values like a module that returns ansible_facts
. How you manage/merge/munge those under the hood -- I don't have a specific opinion.
I am a big -1 on _info
auto stashing data. My preference is sticking with register
.
Otherwise, lets just use _facts
, since there wouldn't be any difference in functionality.
I am a huge fan of being explicit, and I feel like we often get too implicit in implementations. We've gone over things like this before, where it was recommended that we 'auto' stash every module return, and making it available as _
or something, and we ultimately turned it down.
I am a big -1 on _info auto stashing data. My preference is sticking with register.
Can you elaborate on why? Why would your typical end user not want that? (Note: this elite team of contributors is not your typical end user of Ansible)
Auto stashing every task return is excessive that I agree with not implementing that. But an _info
module? What other purpose does that module provide then to return information to be accessed later? How would auto-stashing _info
be inconsistent with _fact
modules that we'd use a different implementation?
My opinion here is that auto stashing actually makes things more complicated. Imagine coming to look at a new playbook, and trying to find where a variable is set, and not being able to find it. Being less explicit, makes playbooks less readable.
How would auto-stashing _info be inconsistent with _fact modules that we'd use a different implementation?
I don't think I said that. I said I want it to be inconsistent with _fact
modules. Otherwise, let's just use _fact
. Why add another thing that serves the same purpose as an existing thing for sake of a new name?
I try to apply "The Zen of Python" to most of my decisions, as it makes a lot of sense to me pretty much everywhere. Based on this, auto-stashing feels to be in opposition to my general stance.
My opinion here is that auto stashing actually makes things more complicated. Imagine coming to look at a new playbook, and trying to find where a variable is set, and not being able to find it. Being less explicit, makes playbooks less readable.
By that logic then we should deprecate auto-stashing of facts. This is exactly the same learning curve every user goes thru learning what and how to utilize facts. Trust me, I'm out in the field teaching new users of Ansible all the time now.
I don't think I said that. I said I want it to be inconsistent with _fact modules. Otherwise, let's just use _fact. Why add another thing that serves the same purpose as an existing thing for sake of a new name?
I thought the difference that we defined between _facts
and _info
modules is that one set of information is host-specific while the other is not. It is not how they are stored.
I try to apply "The Zen of Python" to most of my decisions, as it makes a lot of sense to me pretty much everywhere. Based on this, auto-stashing feels to be in opposition to my general stance.
This is what I meant when I said we here are not the common Ansible user. Most users of Ansible are not python programmers. Hell a significant and fast growing number aren't programmers at all. I appreciate the zen of python, but let's lose sight of our user base and where it's best applied.
I empathize with your points, but I fundamentally disagree with them. I'm going to let others respond here. My view points have been recorded.
IF we do autostash, it MUST be namespaced, otherwise we end up with 'top level pollution as we have with facts currently.
Doing this on a 'non host' basis is a bit confusing as plays iterate over host ... so we execute the task for each host but then use global scope? what happens when hosts get conflicting data? i.e hostA is in us-east-1, hostB is in us-west-1, security group info is different or do you want _info to automatically bypass the host loop? then we have the opposite issue ...
@dagwieers also you might be more interested in https://github.com/ansible/proposals/issues/17 , which already assumes a 'user defined' fact gathering settable per host
@bcoca: I agree with these "info facts" being namespaced. That's been a given in my mind to avoid the exact sort of conflicts you jumped in with. The "info" is auto stashed/merged/updated in the global (I have no specific preference on the name) namespace and that one alone. If a user wants to access those global facts/info they specifically ask for them like they would the facts of a different host. They would not be merged and conflicts resolved with host facts by the system. That's treating "info" and "default facts" and that would be messy. That's not what I'm thinking here -- just a place to stash information in bulk discovered by modules that is not host-specific but rather utilized by all.
but that is the basic issue, task/actions are always host specific
Understood and correct. So how are we recommending facts that don't belong to a host (for instance some resource in a virtual/cloud environment) get retrieved and potentially stored by an end user?
Did I miss an option?
@tima, the 'non autoregister' combinations, this seems to work fine with some modules, like find
.
We also have 'non _facts' modules returning facts, like getent
or modules returning facts as a side effect i.e. docker_container
.
I would add the category of 'allowed to add facts' as a side effect of modifying host, most 'create cloud instance/resource' would qualify.
I don't mean to be obtuse, but I'm not following your point.
find
is a query and host-specific. I'm not familiar with genet
but that seems to be host specific also. (The advocates of lookups could argue those shouldn't be modules I suppose.) Stuff like VPC, security group, an ELB is not.
Here is an example from the docs:
- ec2_vpc_subnet_facts:
filters:
vpc-id: vpc-abcdef00
"tag:Name": "{{ item }}"
with_items:
- publicA
- publicB
- publicC
register: subnet_facts
This task will be run for each host in the play and each will have a subnet_facts
with a copy of the same data in it's namespace. As a user, I have to have the presence of mind to use run_once
(a separate play and stash them with a set_fact on some arbitrary host) so this task doesn't waste time and resources.
I also have to remember what var name I'm using and communicate that to the rest of my team and company if that info is going to be reused like in a role. If some other developer decides to use elb_subnet_facts
because they don't know about subnet_facts
or TL;DR the other developer's docs.
If this module just auto-stashed it's facts in a standard place under a namespace where all hosts can access it the need for a register
, run_once
and coordinating the naming of variables are all non-issues.
i was pointing out find as a module that returns host specific info (facts?) but does not autoregister, getent does similar but DOES register facts, there are many existing cases.
ec2_vpc_subnet_facts is badly named, as it does not return facts, for me this would make a good case for a lookup. As for run_once + set_fact ... i think you missed the part that run_once registers facts for ALL hosts in the play.
vars:
vpc_abcdef00_subnets: "{{lookup('ec2_vpc', id='vpc-abcdef00', tags=....}}"
I mostly agree with @sivel but am -1 on the proliferation of lookup plugins for this. The problem is that the author of the functionality usually cannot tell whether the code will have to be executed on a remote box or not. That's something that only the site-administrator can tell. So most things should be created as modules to give the site-administrator the flexibility to run this locally or on a remote box that can access the resource.
+1 for adding lookups, but not for removing modules, as many have mentioned, the controller is not always the machine that has access to query the data.
that still leaves making these _info modules and how we deal with registration of the data.
@bcoca:
i was pointing out find as a module that returns host specific info (facts?) but does not autoregister, getent does similar but DOES register facts, there are many existing cases.
That sounds like a mess and exactly why we need clearer guidance. Both of these examples are queries though and not fact gathering. _facts
and _info modules
should be able to run without any params.
ec2_vpc_subnet_facts is badly named, as it does not return facts, for me this would make a good case for a lookup.
Agreed that is badly named or (ideally) should be changed to return facts -- another example of why clear guidance is needed. I don't agree with that being a lookup for reasons I've already stated.
As for run_once + set_fact ... i think you missed the part that run_once registers facts for ALL hosts in the play.
The run_once
and set_fact
clauses are an OR not an AND. You either have to remember to stick a run_once
on there OR run a separate play and stash them with a set_fact on some single arbitrary host like localhost. (How do you do the latter in the context of a role?) Most users won't be that savvy and will inadvertently run unnecessarily eat up time memory with unnecessary processing and data.
vpc_abcdef00_subnets: "{{lookup('ec2_vpc', id='vpc-abcdef00', tags=....}}"
That looks plain awful and is exactly why it should be a module. We've said expanded/native YAML syntax for our docs and in our best practices and training. This runs contrary to the reasoning we've gone that way with the user base -- better readability.
Putting that aside, wouldn't the above lookup run every time the vpc_abcdef00_subnets
variable is accessed? I supposed a lookup plugin could implement their own sort of caching but do we really want to go there?
@abadger:
I mostly agree with @sivel but am -1 on the proliferation of lookup plugins for this. The problem is that the author of the functionality...
Excellent point @abadger. I feel even stronger on the point and reiterate my -1 on the proliferation of lookup plugins for this also.
Interestingly enough, over on the bindep PR the consensus was to not make it a facts module because people don't like facts modules.
I can see why they wouldn't use facts there. I'm not familiar with the bindep util but reading "retrieve a list of packages that are not already installed based on..." sounds like a query though I'm a bit unsure. None of the params are required, but then again every example use one or more params.
@tima, to clarify, find and getent are not clearly 'query' modules as they do get 'host specific info', unlike ec2_vpc_subnet_facts which returns the 'subnets available' not the ones associated to a host. But yes, this mess is WHY this thread was started.
Also a note on set_fact
as a way to run a template once (won't get reevaluated on var use):
set_fact:
vpc_abcdef00_subnets: "{{lookup('ec2_vpc', id='vpc-abcdef00', tags=....}}"
Another thing we have been thinking is a !nonlazy
var definition, but that is vaporware right now.
I've also been toying with a better 'lookup sytnax':
vars:
myquery: !lookup
name: consul
args:
usrl: consul://here/is/it
key: mysecret
For '_info' data (data objects that are not host specific), I like the idea of '*_info' modules. I also agree they need to be namespaced. For '_info' data, if there is a name/id/address that can be used to find the values, '_info' access would be similar to fact access, but using the '_info' id instead of the host name.
The list of _info identifiers would act much like an 'object inventory'. var manager would have a data structure for the 'info' namespace, and the keys of vars['info'] would be the _info identifiers (which I'll call the 'info_id' for now). Getting the values of a info_id from the vars 'info' data would be an analog to getting the value of a fact for a particular host.
More or less inventory+host_vars but for things that are not hosts.
module return values could auto populate the 'info' data much like facts do. A current pb/play/block/task would need to know what 'info_id' it is acting on (similar to how they need to know the target host to properly set facts). This could be set in playbooks explicitly (a 'info_objects' keyword analog to 'hosts'). For _info only tasks (tasks that use an _info action/module) the 'info_objects' value could be used in lieu of the task target hostname.
In this scenario, _info gather is very similar to _fact gathering. Once finer grained fact gathering is available, ideally _info gather would share much of the design and implementation and would also gain fine grained gathering.
[Eventually, facts and inventory hosts data etc could be considered as just more specific instances of objects in 'info', where the 'info_idenditifier' just happens to be a hostname...]
One thing I've recently thought about for auto stashing, is in regards to consecutive runs of the info module. We have to curate and ensure that every info module returns a properly name spaced return for every use. Otherwise we would potentially overwrite what we auto stashed.
Probably better if that namespacing was handled by the core engine rather than relying on each developer to execute it consistently and correctly.
If the core engine namespaces, and further does so to avoid collisions between consecutive runs, wouldn't that make it very difficult for the user to know where the data is implicitly stored?
We'd end up having to return the namespace as a register-able variable so that people could find it, at which point auto stashing doesn't make much sense.
@abadger I'm be interested in your thoughts for what a reasonable alternative solution to lookup plugins would be.
I was thinking that if we can get consensus on the 3 points above we can also look at including future features to make this easier for users to control:
aggregate: no|yes
which controls if result overwrites registered variables/facts or merges with existingansible_facts
or just updating ansible_facts
directly. ie ansible_facts.ec2_facts.<varname>
vs ansilbe_facts.<varname>
In the last two Public Ansible Project Meetings, it has been decided that _facts
modules should always return ansible_facts
, and that _facts
-like modules which don't return ansible_facts
should be called _info
.
In today's meeting, it was also clarified that these policies will be applied starting in Ansible 2.9.
Proposal: Standardize usage of _facts
Author: Dylan Silva <@thaumos> IRC: thaumos
Date: 2016/03/01
Motivation
There are many modules that have implemented different approaches to getting information about an object. Some users are starting to overload state= to gather information. In other cases, _facts modules are used to gather data. There are also lookup plugins.
Define the standard approach to this. So for any new PR's, there will be a single approach to this.
Problems
What problems exist that this proposal will solve?
Solve the inconsistency in Ansible for the one true way of approaching creating, updating, deleting, and providing information of objects.
There are also misconceptions how *_facts modules are related to ansible_facts.
Solution proposal
There are multiple approaches to this. We need to come up with one way. Listed are all the discussed options.
Documentation (optional)
Update documentation to reflect the new decided standard.
Anything else?
This proposal is meant to solicit discussion and come to an agreement.
After the decision is made work with community in adopting new standard and towards migrating existing approaches to the new standard.