ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 579 forks source link

exportfs/pgsql: validate-all fixes #1834

Closed oalbrigt closed 1 year ago

HideoYamauchi commented 1 year ago

Hi Oyvind

We don't see the need for this change in pgsql. What was the problem and why was this fix added?

Best Regards, Hideo Yamauchi.

oalbrigt commented 1 year ago

It is needed for the new pcs feature that runs validate-all on pcs resource create, when it doesnt have these settings set: https://github.com/ClusterLabs/resource-agents/pull/1826

HideoYamauchi commented 1 year ago

HI Oyvind,

Thanks for your comment.

It is needed for the new pcs feature that runs validate-all on pcs resource create, when it doesnt have these settings set: #1826

I would like to know a little more details. What is the behavior of the new pcs that causes problems with the current pgsql? Also, OCF_CHECK_LEVEL is a parameter recognition for changing the depth with monitor etc., but I think that the way it is used in this change is different from the original specification. Will the specifications change in the future?

Best Regards, Hideo Yamauchi.

oalbrigt commented 1 year ago

You can find more info here: https://bugzilla.redhat.com/show_bug.cgi?id=1816852 https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#check-levels

HideoYamauchi commented 1 year ago

Hi Oyvind,

You can find more info here: https://bugzilla.redhat.com/show_bug.cgi?id=1816852 https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#check-levels

Thanks!

I will check the contents. We will also contact you if we have any questions.

Best Regards, Hideo Yamauchi.

HideoYamauchi commented 1 year ago

Hi Oyvind,

I understand that validate will be specified from pcs because data may be corrupted when pcs create/update some resources.

I didn't understand from Bugzilla...

Best Regards, Hideo Yamauchi.

tomjelinek commented 1 year ago

Hi @HideoYamauchi

Is it correct to understand that validate is not executed when outputting a file with pcs -f xxxxxx?

No, that is not correct. Pcs commands pcs (resource | stonith) (create | update) will execute the validation even if -f is specified. That is one of the reasons validation in agents needs to be fixed - when OCF_CHECK_LEVEL is set to 0 or not set, agents cannot assume they run in a cluster and / or on the node the resource will run eventually.

Also, we will pcs cluster cib-push this file output, but is it correct to understand that validate is not executed during cib-push?

That is correct, pcs cluster cib-push doesn't run the validation.

Is there a fixed version of pcs-0.10.14-6.el8.x86_64 for RHEL8 listed in Bugzilla somewhere available? If there is, we also want to check the operation in advance.

I'm not sure if that specific package is available. You can either build pcs from upstream or use CI repository.

Please note that there was another change in pcs: 2159455 - Do not call agents 'validate-all' action in 'pcs (resource|stonith) (create|update)' commands by default

Regards, Tomas

HideoYamauchi commented 1 year ago

Hi Tomas, Hi Oyvind,

Hi @HideoYamauchi

Is it correct to understand that validate is not executed when outputting a file with pcs -f xxxxxx?

No, that is not correct. Pcs commands pcs (resource | stonith) (create | update) will execute the validation even if -f is specified. That is one of the reasons validation in agents needs to be fixed - when OCF_CHECK_LEVEL is set to 0 or not set, agents cannot assume they run in a cluster and / or on the node the resource will run eventually.

I understand that validate-all is executed even with -f. But if so, I still think pgsql needs a fix.

There is no problem when executing pgsql in the local pgdata area on each node, but if you place the pgdata area in shared storage and create a resource with -f, the shared storage is not mounted, so postgresql.conf pgsql RA's validate-all for always fails. (I also get an error with "cat $OCF_RESKEY_pgdata/PG_VERSION" etc.)

This is very problematic when using pgsql RA with -f.

Also, we will pcs cluster cib-push this file output, but is it correct to understand that validate is not executed during cib-push?

That is correct, pcs cluster cib-push doesn't run the validation.

Okay!

Is there a fixed version of pcs-0.10.14-6.el8.x86_64 for RHEL8 listed in Bugzilla somewhere available? If there is, we also want to check the operation in advance.

I'm not sure if that specific package is available. You can either build pcs from upstream or use CI repository.

Okay! If necessary, I will create a package and check it.

Please note that there was another change in pcs: 2159455 - Do not call agents 'validate-all' action in 'pcs (resource|stonith) (create|update)' commands by default

I understand that there are still RAs that do not support validate-all calls such as apache.

Best Regards, Hideo Yamauchi.

oalbrigt commented 1 year ago

I've pushed a commit to solve the issue for pgsql on shared storage.

The apache agent doesnt implement the action yet, but simply returns success, so that just leaves room for improvement for that agent.

HideoYamauchi commented 1 year ago

Hi Oyvind,

I've pushed a commit to solve the issue for pgsql on shared storage.

The apache agent doesnt implement the action yet, but simply returns success, so that just leaves room for improvement for that agent.

Thanks for the fix.

I will check and let you know the result. Please give me some time.

Best Regards, Hideo Yamauchi.

HideoYamauchi commented 1 year ago

Hi Oyvind,

In a shared disk configuration, /dbfp/pgdata/data can only be seen after being mounted by Filesystem RA, so the following will result in an error.

    version=`cat $OCF_RESKEY_pgdata/PG_VERSION`

Also, when an error occurs, this $version seems to be empty, but since the version is unknown, I think it is better not to check $version in validate-all.

If /dbfp/pgdata/data by the test -w below can only be seen after being mounted by Filesystem RA, an error will occur during validate-all as well, so you must avoid doing this.

            if ! runasowner "test -w $OCF_RESKEY_pgdata"; then
                ocf_exit_reason "Directory $OCF_RESKEY_pgdata is not writable by $OCF_RESKEY_pgdba"
                return $OCF_ERR_PERM;
            fi

In the case of PGREX(streaming replication) configuration, the local /dbfp area is used, but if this /dbfp area mounts the device and uses it, new pcs -f is executed before mounting and validate-all is executed first. The same thing happens as above because /dbfp/pgdata/data is invisible.

Considering the above, I think that validate-all can only be called to check the setting of the execution user(getent passwd) and $OCF_RESKEY parameters.

Best Regards, Hideo Yamauchi.

oalbrigt commented 1 year ago

I ended up moving all these checks to their own function, and it should now work for all shared storage cases too.

HideoYamauchi commented 1 year ago

Hi Oyvind,

I ended up moving all these checks to their own function, and it should now work for all shared storage cases too.

Thanks for the fix.

After confirming the contents, I assume a new pcs command and run validate of crm_resouce to check the operation. I will contact you again when the check is complete.

Many thanks, Hideo Yamauchi.

HideoYamauchi commented 1 year ago

Hi Oyvind,

I confirmed your fix.

I ran crm_resource validate with shared disk and PGREX parameters, but no error occurred.

[root@rh87-01 ~]# /usr/sbin/crm_resource --validate --output-as xml --class ocf --agent pgsql --provider heartbeat --option  pgctl="/usr/pgsql-14/bin/pg_ctl" psql
="/usr/pgsql-14/bin/psql" pgdata="/dbfp/pgdata/data" pgdba="postgres" pgport="5432" pgdb="template1" 
<pacemaker-result api-version="2.25" request="/usr/sbin/crm_resource --validate --output-as xml --class ocf --agent pgsql --provider heartbeat --option pgctl=/usr/pgsql-14/bin/pg_ctl psql=/usr/pgsql-14/bin/psql pgdata=/dbfp/pgdata/data pgdba=postgres pgport=5432 pgdb=template1">
  <resource-agent-action action="validate" class="ocf" type="pgsql" provider="heartbeat">
    <overrides>
      <override name="pgdba" value="postgres"/>
      <override name="pgport" value="5432"/>
      <override name="pgdb" value="template1"/>
      <override name="psql" value="/usr/pgsql-14/bin/psql"/>
      <override name="pgdata" value="/dbfp/pgdata/data"/>
    </overrides>
    <agent-status code="0" message="ok" execution_code="0" execution_message="complete"/>
    <command code="0">
      <output source="stderr">error: Could not connect to controller: Connection refused
</output>
    </command>
  </resource-agent-action>
  <status code="0" message="OK"/>
</pacemaker-result>

[root@rh87-01 ~]# /usr/sbin/crm_resource --validate --output-as xml --class ocf --agent pgsql --provider heartbeat --option pgctl="/usr/pgsql-14/bin/pg_ctl" psql="/usr/pgsql-14/bin/psql" pgdata="/dbfp/pgdata/data" pgdba="postgres" pgport="5432" pgdb="template1" rep_mode="sync" node_list="hpdb0101 hpdb0201" master_ip="192.168.108.35" restore_command="/bin/cp /dbfp/pgarch/arc1/%f %p" repuser="repuser" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" stop_escalate="0" xlog_check_count="0"
<pacemaker-result api-version="2.25" request="/usr/sbin/crm_resource --validate --output-as xml --class ocf --agent pgsql --provider heartbeat --option pgctl=/usr/pgsql-14/bin/pg_ctl psql=/usr/pgsql-14/bin/psql pgdata=/dbfp/pgdata/data pgdba=postgres pgport=5432 pgdb=template1 rep_mode=sync node_list=hpdb0101 hpdb0201 master_ip=192.168.108.35 restore_command=/bin/cp /dbfp/pgarch/arc1/%f %p repuser=repuser primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 stop_escalate=0 xlog_check_count=0">
  <resource-agent-action action="validate" class="ocf" type="pgsql" provider="heartbeat">
    <overrides>
      <override name="pgdba" value="postgres"/>
      <override name="pgdb" value="template1"/>
      <override name="pgdata" value="/dbfp/pgdata/data"/>
      <override name="primary_conninfo_opt" value="keepalives_idle=60"/>
      <override name="master_ip" value="192.168.108.35"/>
      <override name="rep_mode" value="sync"/>
      <override name="psql" value="/usr/pgsql-14/bin/psql"/>
      <override name="repuser" value="repuser"/>
      <override name="xlog_check_count" value="0"/>
      <override name="stop_escalate" value="0"/>
      <override name="node_list" value="hpdb0101"/>
      <override name="pgport" value="5432"/>
      <override name="restore_command" value="/bin/cp"/>
    </overrides>
    <agent-status code="0" message="ok" execution_code="0" execution_message="complete"/>
    <command code="0">
      <output source="stderr">error: Could not connect to controller: Connection refused
</output>
    </command>
  </resource-agent-action>
  <status code="0" message="OK"/>
</pacemaker-result>
[root@rh87-01 ~]# 

I think there is no problem.

Best Regards, Hideo Yamauchi.

oalbrigt commented 1 year ago

Great. Thank you for helping iron out the initial issues.