ansible / proposals

Repository for sharing and tracking progress on enhancement proposals for Ansible.
Creative Commons Zero v1.0 Universal
93 stars 19 forks source link

Hold off and redesign Declarative Intent #70

Closed dagwieers closed 6 years ago

dagwieers commented 6 years ago

Proposal: Redesign Declarative Intent

Author: Dag Wieers <@dagwieers> IRC: dag

Date: 2017/09/15

Motivation

During AnsibleFest SF 2017 it was clear that the Declarative Intent functionality that was designed as part of needed ansible-network roadmap is flawed in a few areas.

This proposal does not question the merit of this needed functionality, but some design decisions require improvements before this is ready for wider consumption.

Problems

What problems exist that this proposal will solve?

Solution proposal

Documentation (optional)

Hold off with documenting and promoting this as part of v2.4, so we don't have to support this going forward in light of a user-breaking redesign.

Anything else?

The above feedback was provided during the AnsibleFest SF 2017 Network Working Group and was largely unanimous by participants.

This proposal was made to collect feedback from other participants in order to design a better user interface for this functionality before it becomes a wider standard.

Also collecting pros/cons for each solution in https://public.etherpad-mozilla.org/p/ansible-proposal-70 Please add your thoughts here and they will be rolled into this proposal doc by @gundalow

jmcgill298 commented 6 years ago

+1

I agree with the language of intended vs expected and that there needs to be great consideration given to how these two are integrated.

bcoca commented 6 years ago

i think this is a bit too much of a semantic discussion 'expectations' vs 'intentions' ... you 'expect' things to result as you 'intend'.

In any case this already boils down to what the main project has always advocated, 'declare' vs 'mandate' or that we phrase options and develop actions in a way that is declarative over the traditional imperative.

So anything in a 'state=present' is expected AND intended to have those values, i see no need for an 'expect' extra keyword as all modules should already 'declare expected/intended state'.

itdependsnetworks commented 6 years ago

@bcoca This is why this discussion needs to happen. Configuration state is not operational state, and thus the confusion. Please do not dismiss this as semantic discussion.

dagwieers commented 6 years ago

@bcoca: Except that in this case state and txrate are outside of our control. We cannot control them, hence, it is not an intention. The documentation states "Ansible is not a monitoring tool" but that's what these parameters are really for, to monitor if these values are exactly as we expect them to be.

(And yes, the use of state is highly confusing in these examples, which makes it even worse overall, but a good example of all the things that are wrong with this functionality.)

bcoca commented 6 years ago

@itdependsnetworks that is the one part I see having merit as 'intended' vs 'active' state is normally not as prevalent in servers as it is in appliances, i.e. normally we restart services as soon as possible to remove the difference.

@dagwieers if you are using parameters for comparison and not in an effort to affect the actual state, I would argue that is an incorrect usage of state, which is declarative.

dagwieers commented 6 years ago

@bcoca So you agree with me, without knowing it :-)

There are 3 options:

Now it is bolt-on without understanding the consequences, long-term vision or cross-domain insights.

itdependsnetworks commented 6 years ago

@bcoca I think the networking space is largely different, and your word carries a lot of weight, so I expect "not as prevalent in servers as it is in appliances" to be true often. Whether it is this or calling filters that I use on the daily as "extremely specialized" ultimately can undermine ansible producing the best product possible.

dagwieers commented 6 years ago

@bcoca From the DI documentation

Declarative intent modules also take a set of options that allow playbook designers to specify the desired or intended operational state of the resource. These options allow tasks to define the normalized operating state of the resource once the configuration has been completed. State options do not make configuration changes to the resource. They serve to validate that a given resource has normalized its operating environment to reflect the intended state. If the operating state does not reflect the intended (or desired) state of the resource as described in the playbook task, the module is considered failed and remediation actions can be taken in the playbook during the playbook execution.

So it's more like an included assert into the module. We might as well do:

- name: configure interface
  net_interface:
    name: GigabitEthernet0/2
    description: public interface configuration
    enabled: yes
  assertions:
    link: connected
    tx_rate: 10gb

And make this directive part of every module. Would make writing integration tests easier :-) (i.e. no need to register and program conditionals in failed_when)

dagwieers commented 6 years ago

@itdependsnetworks I don't agree, even on servers there are states that we cannot influence directly. The state of a cluster, the network link or negotiated speed, ... anything that is a given and could be changed outside of our control is worthwhile to validate. The question is if this should be some special parameter, or a return value that we can check as part of the task, or a separate assert-task (as we do with integration tests).

bcoca commented 6 years ago

I think my point was missed, that we are getting stuck in the semantics vs the 2 real differences we have here.

jmcgill298 commented 6 years ago

@bcoca I think the people joining on this thread so far in agreement that there is an issue with creating a "check against" type and are trying to find a better solution that 1) treats configuration separate from operation 2) makes it easier for users to not only ensure configuration was made, but ensure that configuration resulted in the correct operational state

Not to give any preference to @dagwieers suggestion, but I do see where the assertions appended to the task are easier to consume and do not create an additional type.

bcoca commented 6 years ago

my preference is to just use include_role/tasks to do the same operations as this is basically 'get facts + compare/assert' .. and I know they are not w/o problems but I would prefer to put the effort to finish bringing them up to spec vs creating a whole new breed.

privateip commented 6 years ago

Just a quick historical note, DI was conceived and much of it built before we had persistent connections. It was needed to overcome some of the limitations we had with the connection module (pre-persistence). Now that is no longer the case, we have the opportunity to re-consider this from the ground floor up. I think I would be more in favor of @bcoca suggestion of better facts + asserts long term.

dagwieers commented 6 years ago

@privateip This is confusing me, because we announced this at AnsibleFest SF last week for the complete audience as part of the v2.4 innovations in the networking space. So are we announcing things we already know we don't want to do going forward ?

privateip commented 6 years ago

No the work was done and we decided to move forward with it for networking modules. My comments are in the context of considering a more broadly implementable solution across all of Ansible.

dagwieers commented 6 years ago

That's what I think is not a good idea. I don't think we want different Ansible concepts depending on the related technology domain. It's too confusing IMO.

privateip commented 6 years ago

Agreed ... if we adopt an Ansible wide solution to this I'm fully on board to refactoring the network modules to align with the strategy.

gundalow commented 6 years ago

The Core Network team has agreed that we will not add the technical guide (https://github.com/ansible/ansible/pull/26954) for Declarative Intent into 2.4.0 to avoid confusing the end users.

ganeshrn commented 6 years ago

@bcoca Can you please elaborate on include_role/tasks approach that you proposed in above comment.

bcoca commented 6 years ago

@ganeshrn basically, just use tasks to do the same thing vs having to encode it into action/module.

abadger commented 6 years ago

I like both bcoca's ideas and dag's ideas at the moment. bcoca's idea could be done without code changes. But it means we need to start shipping and supporting pre-baked roles. What does that look like?

dag's idea is a much nicer UI than the current declarative intent proposal. It requires changes to the playbook language. However, there's no information on implementation so there's more design work to do before we could spec out just how much work this would be (and the caveats).

samdoran commented 6 years ago

The main debate seems to center around creating new functionality built into modules, relying on existing capabilities that can be handled by registering the output of a task and testing it with another task, or coming up with an elegant way to leverage existing capabilities in Ansible but with a better UI for modules.

Validating the operational state of a system after Ansible makes changes has always been possible, but it's not done as commonly on non-network operating systems. Just trust the module to change the system, and trust the system to change its state.

For the times when that doesn't get the job done (problematic services that don't really start or DBs that take a while to become ready), we have register, assert, until, wait_for, etc. The down side to these is that it puts the burden on the playbook author to put all the pieces of validation together. This can become repetitive very quickly if you need to validate the operational state after every task, which is much more common in network operating systems.

We have the trappings of these type of things already with validate on some modules and waitfor on networking modules (I think that's still a thing). Rather than inventing a third "check against" type of module, how can we unify these existing features and present a better UI to accomplish configuration change and state validation succinctly? We have the technology, as they say.

I think this could be a shorthand/alternative syntax for register + assert/wait_for but reuse existing functionality in Ansible without having to create something new or retool modules (maybe that's far too lofty, but one can dream).


Etherpad for reference/real time collaboration.

bcoca commented 6 years ago

I would say modules already should check that the desired state is achieved, in most cases this is done by checking the 'success' of the tool/api/function call invoked and trusting THAT. What they don't normally do is arbitrary 'extra checks' outside of the commands/API they use i.e sytemctl restart tomcat doesn't make sense that the systemd module also checks http://localhost:8080/status.xml for a 200 http response.

The validate is not exactly what most think, it is pretty limited, it does NOT check that the module did it's work (copy/template produced a file already), but that the 'work product' is accepted/usable by the destination tool BEFORE putting it into place, this is not the same as the features described above.

I'm remiss abgout adding something like this built into the language as it will always be more limited than what a 'full fledged task' can do. Even something like an 'assert strategy' that can interject tasks before/after execution of each task should be a lot more to use and implement (openshift/stack has one already).

dagwieers commented 6 years ago

@bcoca I agree with most of what you say, my only objections (and for me trumps mostly everything else) is convenience and complexity.

Writing a role per task for this seems inconvenient, even if these roles already exist (for every network device ?) there is a burden (selecting, trusting, managing roles) and the complexity understanding and troubleshooting issues when they arise. If you abstract away through roles, and it fails, the error throws you into that complexity.

And the great strength of Ansible is that it makes hard/complex tasks easy through a simple interface. Do we move the complexity into roles (user-facing) or in the core (abstracted away in a single task) ?

If my example was too limited compared to assert that's solved by doing this:

- name: configure interface
  net_interface:
    name: GigabitEthernet0/2
    description: public interface configuration
    enabled: yes
  assertions:
  - this.link == 'connected'
  - this.tx_rate == '10gb'

True, it doesn't offer the same as one or more tasks could offer, but maybe it fits 80% of the use-cases, and the remaining part can still add the required tasks.

mikedlr commented 6 years ago

Specific example of situation where the result of the module may not be what we want:

when we create an RDS instance, we have the option to wait for a timeout. If we do wait then sometimes the RDS instance is up and running as expected. Sometimes it takes longer, up to over 10 minutes.

It would be very useful to have the actions like creating a database in the DBM instance happen at the point when the RDS has finally started working. I currently do exactly that: create the instance without waiting, then do some other slow actions (create ec2 instances) then come back and wait for the instance to be up before continuing with further dependent configuration.

mikedlr commented 6 years ago

My proposal would be something like:

   - name: check that interface is ready
     assert: 
         module: net_interface_facts
             link: connected
             tx_rate: 10gb        
        retry_timeout:  200
        assertion:
             - "succeeded|true"

  - name: configure interface
    net_interface:
      name: GigabitEthernet0/2
      description: public interface configuration
      enabled: yes

The "assert" (not sure if it should share a name with normal assert, I currently think so) becomes a plugin that runs other modules. The modules can be fact modules or standard modules run in check mode. It keeps retrying until it has success or times out. By running it without retry it simply fails and stops the script.

This can then be used with any existing module which supports check mode (or is a proper facts module) and at the same time solves both the verification and the make sure it's happened cases.

bcoca commented 6 years ago

@abadger, @dagwieers one of the advantages of implementing this in roles/includes is that users don't need a programmer to extend or add these behaviors, a few 'reference implementations' is all that is needed for a much wider audience to start producing coverage for more cases.

That said, looking at dag's example, yes it is nice and concise, but i would argue that it introduces many 'uncontrolled' questions as:

I would argue there is a simpler way to do the same thing, which just needs some extra support in modules, I'll use systemd as an example as it already returns 'state after action' information in it's response:

- name: restart tomcat
 systemd:
    name: tomcat
    state: restarted
    enabled: yes
  failed_when:   # basically a 'reverse assert'
    - result.ActiveState not in ['Active', 'Activating']
    - result.UnitFileState !=  "enabled"
    - result.UMask != "0022"

Like with @mikedir 's 'assert wrapping modules' , this would require minor modifications to modules to return the 'after action state', probably with a toggle as this might be pretty large and time consuming in some cases. If we really need to 'build in assert' I believe this is probably the most effective way and should be less onerous on users as it does not duplicate failed_when as a new keyword assertions.

privateip commented 6 years ago

+1 @bcoca proposal. We are essentially doing a form of this today with *_command modules (see wait_for argument here). However, by making this a task directive instead of baking into the module seems to align much more universally across the playbook while still maintaining the visual separation of config vs state.

dagwieers commented 6 years ago

@bcoca This is very close to what I am proposing. The difference between failed_when and assertions is that assertions would allow you to use a different failure-path (e.g. ignore them and continue with the playbook). Since we don't control this state, we may not want to fail on them in certain cases.

It would mean you can control how ansible-playbook behaves by default, and on-demand. (See discussion above about this being the state of the cluster, or the state of an interface link).

  • which other module do i execute to gather the facts?

This example expects the information to come from the module. So yes the module exposes this additional information relating the object being managed. (In case of network modules, it would implement the required delay to intercept possible changes related to the task, just as it was with DI)

  • do i used the same connection/become information?

It is still one task, so yes I guess.

  • how do i handle errors caused with this 'hidden task'?

That is up to the user if he wants this to be a possible failure case or not. I can see use-cases for both fail on assertion failure, or continue with the playbook.

  • do i expose the results? how?

Either it works or it fails (and the user wants it to fail). If the user does not want to fail on assertions, exposing a warning seems useful.

  • loops?

Would work because it's just a task with some additional handling on exit.

dagwieers commented 6 years ago

BTW I always wanted to make the proposal to introduce this as the default registered variable for the task result values. Which you could use with changed_when, failed_when and now assertions.

- command: foo bar
  changed_when: this.rc in [2, 3]
  failed_when: this.rc >= 4

Avoids having to register and is much cleaner if you don't need to expose it to other tasks.

bcoca commented 6 years ago

@dagwieers so i think mixing up 3 features takes us nowhere:

dagwieers commented 6 years ago

@bcoca We need a way to influence whether the assertions fail or not at run-time. Sometimes you want to test the operational state as part of the playbook (because you want to verify things are exactly as you like), but at other times you want to perform all tasks regardless of the operational state, because you can fix those after running that playbook.

This means we need a run-time flag to influence the behavior (fail, warn or ignore ?), and a default behaviour.

failed_when is pretty specific to that, it defines when the task must fail. But if we have no direct control over the operational state we are testing, we may not want to fail.

dagwieers commented 6 years ago

Talking about use-cases and user interfaces makes perfect sense to get to something useful., it doesn't matter we are building new concepts that are useful to a wider audience. Not sure why you consider this taking us nowhere.

Kind of condescending, especially if you ignore half of what I am saying and adopt the other half :-)

I will make the proposal for this and we can consider it as a given here.

bcoca commented 6 years ago

This means we need a run-time flag to influence the behavior (fail, warn or ignore ?), and a default behaviour.

ignore_errors?

I'm not sure how that is condescending, I'm just pointing out that if you keep adding features on top of the existing discussion, it will extend infinitely, discussing each feature separately will limit the scope and length of such discussions.

dagwieers commented 6 years ago

You can't design an equivalent solution for DI by discussing each "feature" independently. That will bring us nowhere, because we don't know yet what we want. (Look at how this evolved from top to bottom).

The condescending part includes you ignore half of what I am saying.

No, ignore_errors does not help controlling these assertions at run-time. It does not separate the stuff that is a real module failure with operational states.

Take the original example:

- name: configure interface
  net_interface:
    name: GigabitEthernet0/2
    description: public interface configuration
    enabled: yes
  assertions:
  - this.link == 'connected'
  - this.tx_rate == '10gb'

You may want to run it with the behavior that a failed assertion may be ignored, or should be a warning. (Some flag, I don't know how to name it --failed-assertions=warn. I am sure people can come up with a better name, I am not married to a specific name either. We called it expectations before.)

In your case:

- name: configure interface
  net_interface:
    name: GigabitEthernet0/2
    description: public interface configuration
    enabled: yes
  failed_when:
  - this.link != 'connected'
  - this.tx_rate != '10gb'

What if there is no link on one interface and you're running this in production and a lot of important configuration follows. What to do if this an urgent operation in a maintenance window. Sure the fact that we have a link down is something that needs to be looked at, but does it warrant stopping the complete execution ? I don't think so. A bad cable should not block you from performing other actions.

So let's add ignore_errors in the playbook, now it will continue even if the module would fail hard for things we do control and should work. Operational states are a different beast than configuration items we control ourselves. And may need to be handled differently.

bcoca commented 6 years ago

well, the ignore_errors is also templatable and can handle complex conditions, but not need to go that far when the failed when can also do same.

  failed_when:
  - assertions_fatal|bool AND ( this.link != 'connected' AND  this.tx_rate != '10gb') OR this|failed

in any case your examples did not show an additional 'toggle' for assertions being fatal or not, which should not be lmited to a global scope as you might want to handle that a block/role/include/task level also, so you need another keyword for that too.

dagwieers commented 6 years ago

Right, but then it becomes an inconvenient and very custom thing. Again, that's where designing this as a new concept in core makes it easier and valuable for people to use. That's how we evolved from custom shell scripts, to templated playbooks, to using Jinja2, to proper YAML playbooks and various neat features.

The same reason why I think using roles, while technically possible, is inconvenient and overly complex.

The proposal is still growing as we discuss, and it depends on naming, functionality and use-cases. Whether we want to handle this per block/role/include/task is something to discuss. My use-cases wouldn't require it, but I don't want to force my use-case one everyone and wouldn't mind if this is configurable at every level.

(Remember this was all to find an alternative and more versatile solution for Declarative Intent, so I would like to learn from the Networking people how they would be using this before making any decisions. However I can see myself using this in playbooks and for integration testing.)

itdependsnetworks commented 6 years ago

Agreed with @dagwieers it is certainly possible to do this with roles, just inconvenient and overly complex.

bcoca commented 6 years ago

i would argue it is less complex with roles, you are just shifting the amount of people that can handle this from people that can write/plays roles to developers of the modules and core engine.

No matter which avenue we need to write this for ALL the OSs/functions/etc, i find that creating 'reference roles' as I mention above will get us there a lot sooner than 'reference modules' and core features.

gundalow commented 6 years ago

Also collecting pros/cons for each solution in https://public.etherpad-mozilla.org/p/ansible-proposal-70

itdependsnetworks commented 6 years ago

I think one thing to remember is this functionality already exist, @bcoca is your intention to remove this functionality?

gundalow commented 6 years ago

Would would be useful is if everyone[1] please describes what they believe the problem we are trying to solve is

[1] Yes, that includes you

gundalow commented 6 years ago

As a network engineer running a Playbook , how can I ensure that at the end of a Playbook run my network is configured correctly. Given that my network contains things that Ansible can not configure (external factors), such as ensuing that a physical cable is plugged in and links to switches and that link is UP.

If a check fails (cable unplugged) I want the ability to either roll back configuration or fail at that point

itdependsnetworks commented 6 years ago

1 - Updating the operational state to be a dict while keeping the same functionality 2 - Make this an abstract enough feature that can be applied to other modules.

gundalow commented 6 years ago

All, We plan to conclude the discussion on Wednesday 4th October, so please ensure all your thoughts are added by then.

rcarrillocruz commented 6 years ago

I'm going to drop my thoughts:

  1. While this can be done with roles, I don't think we should rely on roles. I tend to see roles as last-resort thing to use when modules don't give me all options, and this is a feature that is general enough to put into modules IMHO.
  2. I really want this to be a general thing, not a bolt-on networking thing.
  3. I like where the whole 'assertions' thing is going, but as @bcoca said I'm a bit torn adding a new keyword, as there's overlap with failed_when. I rather augment failed_when expressiveness than adding a new one.

How about something like:

- name: configure interface
  net_interface:
    name: GigabitEthernet0/2
    description: public interface configuration
    enabled: yes
  failed_when:
    stop_on_failure: no
    assert:
      - stop_on_failure: yes
        that: "this.link != 'connected'"
      - that: this.tx_rate != '10gb'
        warn_msg: "ZOMG, your link is not 10Gb, however you can move on"
      - that: this.rx_rate != '10gb'

That way we could use failed_when as the current 'fail in case this list of assertions fail', but we also augment it to finer grain control to say 'ok, in general i don't want to stop play if asserts are false just warn (which message could be configured in an assert by assert basis) BUT treat each one of the assertions individually as they may change behaviour as needed'. This also uses assert/that keywords, that is already available in the Ansible language making it easier to learn for existing users. I expect we could also reuse/refactor logic from assert module and librify functionality to use it here.

  1. While we can template ignore_errors, it becomes complex quickly, we should be able to express things easier like above example
bcoca commented 6 years ago

A few of things about 'roles approach':

So i would dedicate a short time to creating a couple of these roles, share them on galaxy and see what happens. Even if you want to build this into the core language and update every module to support it, the roles would still work regardless and provide feedback from users in a way that is quick and easy to respond to.

rcarrillocruz commented 6 years ago

Also, the failed_when could expose a delay param, that would effectively mean 'wait $delay time before gathering facts and checking assertions'. Like:

- name: configure interface
  net_interface:
    name: GigabitEthernet0/2
    description: public interface configuration
    enabled: yes
  failed_when:
    stop_on_failure: no
    delay: 30
    assert:
      - stop_on_failure: yes
        that: "this.link != 'connected'"
      - that: this.tx_rate != '10gb'
        warn_msg: "ZOMG, your link is not 10Gb, however you can move on"
      - that: this.rx_rate != '10gb'
privateip commented 6 years ago

@dag This is no longer being considered a viable implementation. We are moving to roles to handle this implementation which will be decoupled from Ansible core.

Any objections to closing this proposal at this time?

dagwieers commented 6 years ago

@privateip No, makes perfect sense.