ClusterLabs / OCF-spec

http://standards.clusterlabs.org
20 stars 11 forks source link

Proposed changes for OCF RA API 1.1 #21

Closed kgaillot closed 3 years ago

kgaillot commented 5 years ago

(EDITED)

This is intended to be the last batch of changes before releasing the 1.1 version of the OCF Resource Agent API. Some of these are simple clarifications corresponding to existing practice, but several important choices should be discussed:

tomjelinek commented 5 years ago

Re: https://github.com/ClusterLabs/OCF-spec/pull/21/commits/92138a3da45718ca9df218e006438bde5c31c87e ra/next: Document new "deprecated" attribute in API:

There is already a mechanism for deprecating parameters established in fence agents since fence-agents-4.0.11-57.el7 which is quite different. It uses attributes deprecated and obsoletes:

<parameter name="old-param" deprecated="1"/>
<parameter name="new-param" obsoletes="old-param"/>

This is already supported by pcs-0.10 including chained deprecations like this:

<parameter name="old-param" deprecated="1" />
<parameter name="new-param" deprecated="1" obsoletes="old-param" />
<parameter name="newer-param" obsoletes="new-param" />

This has never made it to OCF standard, the same as some other features in fence-agents metadata. I guess that's just the way fence-agents development works / worked.

The question is: can OCF adopt what fence-agents already have and drop changes proposed in https://github.com/ClusterLabs/OCF-spec/pull/21/commits/92138a3da45718ca9df218e006438bde5c31c87e?

krig commented 5 years ago

@tomjelinek There is a long discussion in #15 about the attribute, for context.

tomjelinek commented 5 years ago

What I'm missing here (correct me if this has already been implemented / proposed) is support for listing allowed values of enum parameters. So that instead of listing allowed values in a description like this:

<parameter name="no-quorum-policy" unique="0">
    <shortdesc lang="en">What to do when the cluster does not have quorum</shortdesc>
    <content type="enum" default="stop"/>
    <longdesc lang="en">What to do when the cluster does not have quorum  Allowed values: stop, freeze, ignore, suicide</longdesc>
</parameter>

they could be listed like this:

<parameter name="no-quorum-policy" unique="0">
    <shortdesc lang="en">What to do when the cluster does not have quorum</shortdesc>
    <content type="enum" default="stop">
        <option value="stop" />
        <option value="freeze" />
        <option value="ignore" />
        <option value="suicide" />
    </content>
</parameter>

This is again something fence-agents implemented a long time ago without pushing it to OCF.

tomjelinek commented 5 years ago

@tomjelinek There is a long discussion in #15 about the attribute, for context.

Thanks, I wasn't aware of that discussion. So it seems the fence-agents developer implemented deprecation the way I described and asked us to support it in pcs. And then he proposed and managed to get accepted a different syntax without even bothering of letting us know. And if I'm not mistaken, fence-agents still implement and use the original obsoletes way.

krig commented 5 years ago

@tomjelinek TBH as the crmsh maintainer, my perspective is that I would rather focus on making the specification as good as possible than on what is easier or more convenient to support in the shell. If there is a technical reason to prefer the current deprecation syntax of fence-agents then lets discuss those, but otherwise maybe the better solution would be to move fence-agents to the new OCF syntax instead?

tomjelinek commented 5 years ago

@krig In general, I agree with you. On the other hand, if there is a solution already implemented, a possibility of going with that should be at least considered. If what is implemented is reasonably good, the benefits of changing it should outweigh or at least equal amount of time and work needed for the change. Let's hear what others have to say and then we can all agree on how to proceed.

Also, my point wasn't to say: use what we have implemented in pcs at whatever cost. I wanted to show what the situation is, how or why we get there, and provide additional inputs and aspects to be considered.

That being said, it would be good to know whether crmsh already supports parameters deprecation and if so then which of the two syntaxes.

There is also an alternative of metadata consumers (pcs, crmsh, etc.) deciding what to do based on which OCF version the agent's metadata follow. So for OCF-1.0 agents unique and obsoletes would be processed, while for OCF-1.1 agents group and replaced-with would matter instead. That would, however, require agents to specify in their metadata which version of OCF they follow.

krig commented 5 years ago

@tomjelinek crmsh has some basic support for the fence-agents style deprecation, yeah.

oalbrigt commented 5 years ago

+1 regarding adding enum parameters as proposed above, and consider using already implemented obsoletes/replaced-wth syntax from fence-agents already implemented in pcs and crmsh.

jnpkrn commented 5 years ago

PROPOSAL START

Quick unrelated (and possibly already redundant) proposal that I'd put more effort into proper wording of when the feedback is good:

Beside the verbs appointed in this standard with a clear semantics, there's, since 1.1., a set of specifically excluded verbs marked as reserved so as to provide this standard with contingency regarding future extensions.

Agents compliant with this revision shall hence refrain from overloading them with their private (and very likely conflicting in the future) semantics. In general, it is recommended that any verbs used for the beyond-OCF-scope purposes are (assuming they are lower-cased) prefixed with x-, e.g., x-diagnose.

List of reserved verbs as of the revision 1.1:

  • crash
  • \<bring your own>

Disclaimer: I was once meant to dedicate crash for the correct point of authority driven "chaos testing" of the cluster responses. While this can be synthetised in the core layers, I presume this very leaf based delegation for that purpose could eventually turn handy (when we have a better understanding of what to expect from this kind of survival testing).

PROPOSAL END

kgaillot commented 5 years ago

Re: deprecation syntax, the proposal in #15 (which hasn't been adopted into 1.1 yet) documented in 92138a3 here was intended to address deficiencies in the fence-agents syntax needed by UIs, i.e. to give a textual reason for the deprecation (possibly with translations) and to keep the replacement parameter name in the same place for easier processing. If pcs/crmsh/etc. developers prefer compatibility to those benefits, I'm fine with changing it. (Pacemaker ignores the hints.)

Re: enum parameters, there is already a proposal from #6 merged (though not adopted in 1.1 yet), which is as you describe except using type="select" rather than "enum". This is in the schema and example meta-data but not the markdown.

BTW, one source of confusion is that the standard currently consists of the markdown document plus the RNG schema, and some syntax is in the schema but not described in the markdown. We probably should make the markdown definitive, with the schema provided only for convenience -- but for the time being, maybe we could just mention this situation in the markdown.

There is also an alternative of metadata consumers (pcs, crmsh, etc.) deciding what to do based on which OCF version the agent's metadata follow. So for OCF-1.0 agents unique and obsoletes would be processed, while for OCF-1.1 agents group and replaced-with would matter instead. That would, however, require agents to specify in their metadata which version of OCF they follow.

Yep -- from the beginning, the schema has required agents to specify the OCF version via the version element, and the markdown has said "The minor number can be used by both sides to see whether a certain additional feature is supported by the other party." As an aside, I think we should rename this element ocf-version since it's easily confused with the resource-agent element's version attribute (for the agent version), but that would have to wait for a theoretical 2.0 since it's backward-incompatible.

Re: action name restrictions, I have no problem reserving specific names we think will be needed in the future, but I'd recommend against restrictions on custom action names. One of the explicit goals of the OCF spec is to allow the same script to be used as an OCF agent and an LSB script, which means an agent may have various actions supported by LSB but not OCF (status, force-reload, etc.). Another explicit goal is to allow the use of arbitrary custom actions, which we want to make as easy as possible for users, so if name collisions are a practical concern (which I do not think they are), I would rather reserve a prefix for future OCF actions than impose a restriction on agent developers. It's noteworthy that LSB allows init scripts to define arbitrary additional actions as well ("Other init-script actions may be defined by the init script") but does not reserve any names.

jnpkrn commented 5 years ago

On 22/02/19 16:01 +0000, Ken Gaillot wrote:

Re: action name restrictions, I have no problem reserving specific names we think will be needed in the future, but I'd recommend against restrictions on custom action names. One of the explicit goals of the OCF spec is to allow the same script to be used as an OCF agent and an LSB script, which means an agent may have various actions supported by LSB but not OCF (status, force-reload, etc.). Another explicit goal is to allow the use of arbitrary custom actions, which we want to make as easy as possible for users, so if name collisions are a practical concern (which I do not think they are), I would rather reserve a prefix for future OCF actions than impose a restriction on agent developers.

works as well, indeed, and perhaps better. Partly wanted to raise awareness wrt. future-proofing in general along that specific proposal. And partly I imagined people running the actions manually. Note that initscript != agent so it's quite OK to claim sovereignty on selected parts of the namespace and make others to use a dedicated subnamespace if they want to stay safe going forward for sure. Announcement of what's reserved was meant to indicate what's at risk with the most imminent (order of years, anyway) probability.

It might make sense, though, to use the prefixes with the proposed "profiles", see the other issue here. And perhaps "crash" action in particular would be specific enough to go under "pacemaker" (or even dedicated "chaos") profile.

It's noteworthy that LSB allows init scripts to define arbitrary additional actions as well ("Other init-script actions may be defined by the init script") but does not reserve any names.

-- Poki

tomjelinek commented 5 years ago

Re: deprecation syntax, it's not just about the syntax, the two variants are not equally powerful. The obsoletes variant allows one parameter to be deprecated by several parameters. That is not possible in the replaced-with syntax. I actually consider this to be a bug (or at least a feature which was not aimed for) of the obsoletes way. Also the replace-with syntax is cleaner as it keeps all info at one place and provides means for specifying description.

Since a new parser for OCF-1.1 has to be implemented anyway for all the other OCF-1.1 features, I'm fine with the replace-with way being accepted for OCF-1.1. For OCF-1.0 agents pcs will keep working with the unstandardized obsoletes way.

Re: enum parameters, the proposal from https://github.com/ClusterLabs/OCF-spec/pull/6 works for me. It is already implemented in fence-agents and I don't see a reason for changing it. That enum was my mistake, I took it from pacemaker's metadata.

That would, however, require agents to specify in their metadata which version of OCF they follow.

Yep -- from the beginning, the schema has required agents to specify the OCF version via the version element

Fence-agents do not seem to do that, hence my note. Good to know.

krig commented 5 years ago

The obsoletes variant allows one parameter to be deprecated by several parameters. That is not possible in the replaced-with syntax.

Hrm, yes, that looks like a bug.. It would make sense to enable both several parameters to be deprecated by one parameter and the opposite. If replace-with is a contained tag, multiple such tags could be allowed.. like

<deprecated>
<replaced-with id="foo"/>
<replaced-with id="bar"/>
</deprecated>
kgaillot commented 5 years ago

The obsoletes variant allows one parameter to be deprecated by several parameters. That is not possible in the replaced-with syntax.

Hrm, yes, that looks like a bug.. It would make sense to enable both several parameters to be deprecated by one parameter and the opposite. If replace-with is a contained tag, multiple such tags could be allowed..

That actually came up in the discussion for #15 -- see the comments starting May 1. The submitter decided on the current approach with such situations described in the desc. Multiple replaced-with elements makes sense to me though, as long as that isn't more trouble for GUIs to use. As suggested in the other PR, I'd use name instead of id, otherwise it looks like the id of the replaced-with element itself.

Note that the fence agents approach ("obsoletes=" in the new parameter) doesn't allow for the reverse situation, i.e. a single new parameter obsoleting multiple old ones.

jnpkrn commented 5 years ago

On 25/02/19 15:33 -0800, Ken Gaillot wrote:

Note that the fence agents approach ("obsoletes=" in the new parameter) doesn't allow for the reverse situation, i.e. a single new parameter obsoleting multiple old ones.

FWIW, this is least of a problem, since the trend is to either have data as well-structured as possible (which this discussed sub-point of the changes is a proof of), or to have notations formed to concatenate a set of related parameters in an application-specific way (JDBC connection strings, etc.). But since the latter is typically fully equivalent to the structured data approach, these two can live together as mutual alternatives, without a need to deprecate anything since both are justifiable as a standardized configuration interface.

Now, to express this alternation and other similar rather complex relationships, I've suggested to use the power of RelaxNG we already have good familiarity with together with appropriate transformers between RNG-verifiable XML document and more appropriate consumption/actionable formats, such as plain "meta-data" output: https://lists.clusterlabs.org/pipermail/users/2016-March/009785.html

-- Jan (Poki)

krig commented 5 years ago

@jnpkrn A somewhat related situation which comes up here and there (for example in the apache agent) is where you have two sets of conflicting parameters (for example, either configure ip AND address OR configure url, but never ip and url).

This is something that I would like to be able to express, but I'm not sure what the syntax for that might look like. But from the UI perspective, being able to reject invalid combinations with a clear error message would be valuable.

lge commented 5 years ago

from the UI perspective, being able to reject invalid combinations with a clear error message would be valuable.

Isn't that what the validate-all action was intended for? Validate parameter settings, and accept or reject (with a reason)?

Less complex situations can maybe described in meta data and resolved by UI tools without asking the agent. But there will always be that weird case where you'd need a "programming language" to describe it, and we already have that: rather just ask the agent...

ondrejmular commented 5 years ago

Re: c6988da. I would suggest to change new attribute group to something like unique-group as it might be more descriptive in what it really means.

tomjelinek commented 5 years ago

from the UI perspective, being able to reject invalid combinations with a clear error message would be valuable.

Isn't that what the validate-all action was intended for? Validate parameter settings, and accept or reject (with a reason)?

Agreed. And agents should also report usage of deprecated parameters, i.e. "Warning: Parameter 'foo' is deprecated, use 'bar' instead" or "Error: Cannot set both 'foo' and 'bar', because 'foo' deprecates 'bar' so they mean the same thing". I mean, with simple deprecation syntax (one new parameter deprecates one old parameter) UI could easily do these validations by itself. But if one parameter can be deprecated by several others or several parameters can deprecate one parameter, then such relations could get quite complex and would have to be expressed in meta-data. Especially for parameters with required=1.

krig commented 5 years ago

@tomjelinek @lge Yes, but in practice, validate-all is no good. It requires having everything needed to actually start/stop the resource in place (like other dependent resources), and it requires root to run. So to do validation/usage help in a UI where you want to be able to do it from a client application, that's not very usable.

kgaillot commented 3 years ago

@tomjelinek @ondrejmular @idevat @oalbrigt @wenningerk @lge @gao-yan @digimer,

This PR has been massively updated. I believe it is ready to be merged, and then I'll bring it to the mailing lists to discuss adoption as the OCF 1.1 standard. Comments welcome.

tomjelinek commented 3 years ago

Looks good to me.

lge commented 3 years ago

@l-mb may remember why OCF_CHECK_LEVEL / depth, and may have an opinion on how to go forward?

kgaillot commented 3 years ago

Before merging this, I realized one potential issue: currently resource meta-data for monitors can specify role=Master or role=Slave to recommend values for particular roles, for example:

<action name="monitor" timeout="20s" interval="10s" role="Master" />

If we simply replace "Master" with "promoted", then there could be some confusion for a while -- some installed agents might use the 1.1 roles, some the old roles (all of resource-agents would be updated at once, but users have custom agents too), and systems might have old versions of software that uses the meta-data (I know pcs does, maybe crmsh too, not sure about anything else) and only looks for the old names.

A couple of possibilities:

I don't have a strong preference. Other ideas welcome, too. I don't believe there are any other similar issues in the proposal.

@oalbrigt @tomjelinek @gao-yan @ondrejmular @idevat @zzhou1

tomjelinek commented 3 years ago

If I'm not mistaken, the action elements are not used directly by pacemaker or agents themselves. Pcs considers them to be default operations configuration and puts them into CIB when creating a new resource based on a particular agent, if pcs users don't say otherwise. Since pcs must be able to figure out if the underlying pacemaker supports Master/Slave or promoted/demoted roles and transform one into the other (e.g. pacemaker supports promoted/demoted but an agent uses Master/Slave) anyway, I don't see much of a problem here. Other tools may be in the same position. If we are about to change the role names, then I think the new OCF version is the best opportunity to do so and I would include such a change in the 1.1 version. Old pcs won't be able to parse the 1.1 version correctly anyway due to other changes implemented in it.

Perhaps I'm missing the point of what the problem is. If that is the case, let me know.

kgaillot commented 3 years ago

If I'm not mistaken, the action elements are not used directly by pacemaker or agents themselves. Pcs considers them to be default operations configuration and puts them into CIB when creating a new resource based on a particular agent, if pcs users don't say otherwise. Since pcs must be able to figure out if the underlying pacemaker supports Master/Slave or promoted/demoted roles and transform one into the other (e.g. pacemaker supports promoted/demoted but an agent uses Master/Slave) anyway, I don't see much of a problem here. Other tools may be in the same position. If we are about to change the role names, then I think the new OCF version is the best opportunity to do so and I would include such a change in the 1.1 version. Old pcs won't be able to parse the 1.1 version correctly anyway due to other changes implemented in it.

Perhaps I'm missing the point of what the problem is. If that is the case, let me know.

Right, as long as pcs understands both terms there's no problem. However users can get resource agents from multiple sources or write their own, and a resource agent could get updated to the new terms before pcs has been upgraded to understand them. How would current pcs handle role=promoted in meta-data?

I'm also unsure whether anything besides pcs uses it.

kgaillot commented 3 years ago

One more bit of feedback wanted: everyone likes "promoted", but "unpromoted" is kind of blah. I like it better when I see it in actual use -- here are a few alternatives for comparison:

  * Clone Set: myclone [myrsc] (promotable):
    * Masters: [ node1 ]
    * Slaves: [ node2 ]

  * Clone Set: myclone [myrsc] (promotable):
    * Promoted: [ node1 ]
    * Unpromoted: [ node2 ]

  * Clone Set: myclone [myrsc] (promotable):
    * Promoted: [ node1 ]
    * Nonpromoted: [ node2 ]

  * Clone Set: myclone [myrsc] (promotable):
    * Promoted: [ node1 ]
    * Base: [ node2 ]

  * Clone Set: myclone [myrsc] (promotable):
    * Promoted: [ node1 ]
    * Basic: [ node2 ]

  * Clone Set: myclone [myrsc] (promotable):
    * Promoted: [ node1 ]
    * Standard: [ node2 ]

pcs resource update meta target-role="Slave"
pcs resource update meta target-role="unpromoted"
pcs resource update meta target-role="nonpromoted"
pcs resource update meta target-role="base"
pcs resource update meta target-role="basic"
pcs resource update meta target-role="standard"

Any final thoughts before we go with "unpromoted"?

kgaillot commented 3 years ago

I kind of like "standard" too, and with the same number of letters as "promoted" it's easier to align output ...

tomjelinek commented 3 years ago

Any final thoughts before we go with "unpromoted"?

It was my impression that the term demoted had been already established. Have I missed anything? Anyway, pcs does not use demoted anywhere except documentation, code comments and a few internal functions, so it's trivial to change it to anything else.

It seems to me there may be an issue with standard, basic and base: unless one knows the context (standard as the opposite of promoted), it's not obvious what the term means. It may be also confusing when talking about promotable vs. clone or primitive resources. I can imagine someone using the term standard for primitive or clone resources to make them distinct from promotable resources. Therefore, I would rather go with a term which is clear and self-explanatory even when there's not much context, such as unpromoted, nonpromoted or demoted.

tomjelinek commented 3 years ago

Right, as long as pcs understands both terms there's no problem. However users can get resource agents from multiple sources or write their own, and a resource agent could get updated to the new terms before pcs has been upgraded to understand them. How would current pcs handle role=promoted in meta-data?

Pcs doesn't really care about rolein action in agents' meta-data. Whatever is there, pcs puts it into CIB. So if users are running agents and pacemaker aware of the new role names, pcs version should not matter. Except that such an old pcs may not be able to parse OCF 1.1 meta-data correctly anyway.

kgaillot commented 3 years ago

Right, as long as pcs understands both terms there's no problem. However users can get resource agents from multiple sources or write their own, and a resource agent could get updated to the new terms before pcs has been upgraded to understand them. How would current pcs handle role=promoted in meta-data?

Pcs doesn't really care about rolein action in agents' meta-data. Whatever is there, pcs puts it into CIB. So if users are running agents and pacemaker aware of the new role names, pcs version should not matter.

That's great, that avoids the issues I was concerned about.

Separately, I just realized pacemaker's current schema restricts what can be in the role field. :-/ Rolling upgrades of any cluster component should cause no problems for existing cluster resources, but adding a new resource using an upgraded agent and an older pacemaker would be rejected as nonconforming, so pacemaker should be updated before (or at the same time as) any resource agents are updated. Hopefully that's acceptable ... we can set "Conflicts: pacemaker < 2.1.0" for agent packages that we control.

Except that such an old pcs may not be able to parse OCF 1.1 meta-data correctly anyway.

That shouldn't be a problem; 1.1 is supposed to be fully backward compatible. As long as pcs ignores unrecognized meta-data attributes, it should be fine.

kgaillot commented 3 years ago

Any final thoughts before we go with "unpromoted"?

It was my impression that the term demoted had been already established. Have I missed anything? Anyway, pcs does not

We had decided against "demoted" to avoid implying that the resource had the "demote" action run on it, or had ever been previously promoted. A resource will be in the base role even if it's only been started and has never been promoted or demoted.

But that's not a deal-breaker. If "demoted" makes more sense, we could still use it.

* Clone Set: myclone [myrsc] (promotable):
    * Promoted: [ node1 ]
    * Demoted: [ node2 ]

use demoted anywhere except documentation, code comments and a few internal functions, so it's trivial to change it to anything else.

It seems to me there may be an issue with standard, basic and base: unless one knows the context (standard as the opposite of promoted), it's not obvious what the term means. It may be also confusing when talking about promotable vs. clone or primitive resources. I can imagine someone using the term standard for primitive or clone resources to make them distinct from promotable resources. Therefore, I would rather go with a term which is clear and self-explanatory even when there's not much context, such as unpromoted, nonpromoted or demoted.

Good point. I guess we're still at "unpromoted" ...

tomjelinek commented 3 years ago

We had decided against "demoted" to avoid implying that the resource had the "demote" action run on it, or had ever been previously promoted. A resource will be in the base role even if it's only been started and has never been promoted or demoted.

Ok, that makes perfect sense. I agree with your reasoning.

I guess we're still at "unpromoted" ...

It seems so.

Hm, one more wild idea. How about calling the role started? If I remember correctly, starting a promotable resource is a two-phase process: first, the resource is started (in the unpromoted mode) and then it is promoted on some nodes. So, do we really need a special name for the 'started - unpromoted' state?

kgaillot commented 3 years ago

Hm, one more wild idea. How about calling the role started? If I remember correctly, starting a promotable resource is a two-phase process: first, the resource is started (in the unpromoted mode) and then it is promoted on some nodes. So, do we really need a special name for the 'started - unpromoted' state?

I thought the same thing initially :)

The problem is target-role=Started in pacemaker. That's the default, and it allows a promotable resource to be in either role. By contrast, target-role=Slave prevents the resource from being promoted (e.g. if someone wants to temporarily make a database read-only).

tomjelinek commented 3 years ago

The problem is target-role=Started in pacemaker. That's the default, and it allows a promotable resource to be in either role. By contrast, target-role=Slave prevents the resource from being promoted.

I was sure that there would be a catch like this. I just couldn't figure it out at the moment. :)

kgaillot commented 3 years ago

Based on the feedback so far, I'm going to merge this. As a reminder, it remains a proposal (in the "ra/next/" directory). The next step will be a new pull request to pull these recommendations into the ra/1.1/ directory and remove the "DRAFT" designation, which I'll do shortly. I'll bring it up on the users@clusterlabs.org list for wider discussion at that point.

gao-yan commented 3 years ago

Thanks for bringing this up, @kgaillot.

Before merging this, I realized one potential issue: currently resource meta-data for monitors can specify role=Master or role=Slave to recommend values for particular roles, for example:

<action name="monitor" timeout="20s" interval="10s" role="Master" />

If we simply replace "Master" with "promoted", then there could be some confusion for a while -- some installed agents might use the 1.1 roles, some the old roles (all of resource-agents would be updated at once, but users have custom agents too), and systems might have old versions of software that uses the meta-data (I know pcs does, maybe crmsh too, not sure about anything else) and only looks for the old names.

crmsh does parse this part of meta data also. On tab completions, it lists the monitors with different roles and advised options for users to pick and use on configuring.

A couple of possibilities:

  • We go ahead anyway, and users (or packagers) are responsible for ensuring that all their cluster software is at a version that understands the 1.1 standard names.
  • Or, leave role as deprecated meta-data and add a new one like "only-role" or something like that, so agents could specify both (eventually dropping role), like:

I don't have a strong preference. Other ideas welcome, too. I don't believe there are any other similar issues in the proposal.

It shouldn't be difficult to make crmsh understand both terminologies... And it should be fine for packagers to make sure the components get updated with the support enabled in the right order or with appropriate dependencies.

@liangxin1300 @zzhou1 , do you have any concerns?