YanChii / ansible-role-postgres-ha

Create postgresql HA auto-failover cluster using pcs, pacemaker and PAF
Apache License 2.0
33 stars 22 forks source link

Error: You must set meta parameter notify=true for your master resource #22

Closed benmccown closed 5 years ago

benmccown commented 5 years ago

I am testing further with Postgres 10 (using postgres_ha_paf_geo_patch: true) The playbook runs alright until it gets to the "check if all slaves are connected" step which fails all of its retries.

Checking the servers it appears the pcs resources failed to start. Running a "pcs resource debug-start postgres" I get the following output

[root@postgres-test-01 ~]# pcs resource debug-start postgres                                       
Operation start for postgres:0 (ocf:heartbeat:pgsqlms) returned: 'not installed' (5)                             
 >  stderr: ocf-exit-reason:You must set meta parameter notify=true for your master resource      

[root@postgres-test-01 ~]# pcs resource show                                                                                          
 pg-vip (ocf::heartbeat:IPaddr2):       Stopped                                                                                                                       
 Master/Slave Set: postgres-ha [postgres]                                                                                                         
     Stopped: [ postgres-test-01.example.net postgres-test-02.example.net postgres-test-03.example.net ]

So it appears that the "create master DB resource" step is having some kind of problem getting the notify=true parameter to stick.

I ran the following manually which seemed to fix it: pcs resource update postgres-ha notify=true

I am wondering if there is some kind of problem with the pcs_resource module? Here's the output of that part of the playbook (below). No errors resulted, just a "changed" status. But something is being lost in translation.

{
    "_ansible_parsed": true,
    "invocation": {
        "module_args": {
            "operations": null,
            "group": null,
            "name": "postgres-ha",
            "resource_id": "postgres-ha",
            "disabled": true,
            "ms_name": "postgres",
            "command": "master",
            "type": null,
            "options": "master-node-max=\"1\" master-max=\"1\" clone-max=\"3\" notify=\"True\" clone-node-max=\"1\""
        }
    },
    "changed": true,
    "_ansible_no_log": false,
    "msg": "Running cmd: pcs resource master postgres-ha postgres master-node-max=\"1\" master-max=\"1\" clone-max=\"3\" notify=\"True\" clone-node-max=\"1\" --disabled"
}

Any ideas I could try?

Thanks for your help.

Regards, Ben

YanChii commented 5 years ago

Hi @MooseBeanDev

It looks to me as if the PAF patch for checking the notify parameter was not applied. This is the corresponding check: https://github.com/YanChii/ansible-role-postgres-ha/blob/master/files/pgsqlms-2.2.0-fix-pg10#L1274

Can you please check if the line is the same in your PAF installation? The file is /usr/lib/ocf/resource.d/heartbeat/pgsqlms.

And also, what is the output of the command? crm_resource --resource postgres-ha --meta --get-parameter notify It is the same check as mentioned here ClusterLabs/PAF/issues/141

Thank you.

Jan

benmccown commented 5 years ago

Hi Jan,

Thanks for your response.

The section "apply geo-HA patches to DB failover agent" didn't appear to correctly be recognizing the way I was setting the postgres_ha_paf_geo_patch variable (via an ansible tower survey). I changed the conditionals to the below and it seemed to work for me.

- name: apply PAF v2.2.0 fix for newest pacemaker
  copy:
    src: 'pgsqlms-2.2.0-fix-pg10'
    dest: '/usr/lib/ocf/resource.d/heartbeat/pgsqlms'
  args:
    owner: root
    group: root
    mode:  0555
  when: postgres_ha_paf_version == '2.2.0' and
        not postgres_ha_paf_geo_patch

- name: apply geo-HA patches to DB failover agent
  copy:
    src: 'pgsqlms-{{ postgres_ha_paf_version }}-geo-patched'
    dest: '/usr/lib/ocf/resource.d/heartbeat/pgsqlms'
  args:
    owner: root
    group: root
    mode:  0555
  when: postgres_ha_paf_geo_patch

Now that your patch is getting applied properly for the notify parameter, I am almost to a healthy state, but not quite there.

My PCS resources are still not healthy after running the playbook. If I reboot the master or one of my nodes they will come up healthy with the pcs resources running. A "pcs resource debug-start" will also fire postgres right up. Below is the "failed actions" from a "pcs status" command before I rebooted one of the problem nodes.

Failed Actions:
* postgres_monitor_15000 on postgres-test-01.example.net 'unknown error' (1): call=102, status=Timed Out, exitreason='',                                 
    last-rc-change='Mon Jan 14 21:35:29 2019', queued=0ms, exec=10001ms

This is consistent behavior for me, I'm able to repeat it every time I run the playbook.

I am unable to continue troubleshooting for today. I will dig into more logs tomorrow and see if I can get some more verbose information rather than the above.

Thank you, Ben

YanChii commented 5 years ago

Hi @MooseBeanDev

thank you for reporting. I've just fixed the variable checking as you suggested. Sorry for the close/open, I've mentioned "fixes" in the commit message so the automation made it's work :).

Regarding your current issue, I'm not sure what is the cause. I've tested it now and the cluster assembled itself perfectly. To better understand your issue: Does the role finish without error? If so, the DB slaves are correctly connected to the master and you should see good health of the resources. Otherwise this check would not pass. If it does not finish good, where does it fire the error?

Maybe try completely fresh run with cleanup before running the role as described here.

Thank you for more info.

Jan

benmccown commented 5 years ago

Hi @YanChii

Apologies for more delays. Quite a few other tasks have required my attention.

To summarize the issue: The playbook would indeed fail on the "check if all slaves are connected" task. The was due to the pcs resources not starting up properly even though they were enabled. The resources would start correctly with a "pcs resource debug-start postgres" and would also come up correctly after rebooting my cluster VMs, leaving me a little clueless as to root cause.

In my troubleshooting steps I stumbled across "pcs resource refresh" which seems to resolve my issue. Unfortunately I cannot shed more light on why this behaves the way it does. But it seems like there was already a "refresh pcs resource" task in the finalize.yml file, so I have added the pcs command there.

Pull request: https://github.com/YanChii/ansible-role-postgres-ha/pull/23

Thanks for all your help with this.

Regards, Ben

YanChii commented 5 years ago

Hi Ben,

Thank you for your analysis! I'm looking into the pcs man page… and the refresh command does exactly the same thing as cleanup plus re-detecting the state of the resource. Seems like this is the right command we have needed here.

Can you pls replace the cleanup command by refresh in your PR? It's not necessary to stress the cluster by double-resetting the resource state.

Thank you again.

Jan

benmccown commented 5 years ago

Hi Jan,

Thanks for looking into that and filling me in. I should have dug into the man pages as well. I have done as requested and replaced pcs cleanup with pcs refresh in my PR.

Appreciate your help again.

Regards, Ben

YanChii commented 5 years ago

Hi Ben,

You need to specify postgres_ha_cluster_pg_HA_res_name after refresh. Otherwise it will reset only fencing devices.

And I've just realized that order might matter as well - manage should probably be first, then clean and refresh as the last ultimate one.

Can you please update that?

Thank you very much for your help.

Jan

benmccown commented 5 years ago

Hi Jan,

Understood. I have added one more commit.

Thank you, Ben

YanChii commented 5 years ago

Perfect! Merged to master, auto closed. Thank you & enjoy the fruit of your work :). Jan