ansible-collections / ibm_zos_core

Red Hat Ansible Certified Content for IBM Z
77 stars 44 forks source link

[Bug] "BGYSC0808E Unknown error from JMONCONS" when running zos_operator on two LPARs of the same sysplex concurrently #937

Open LuiggiTorricelli opened 1 year ago

LuiggiTorricelli commented 1 year ago

Is there an existing issue for this?

Are the dependencies a supported version?

IBM Z Open Automation Utilities

v1.2.4

IBM Enterprise Python

v3.11.x

IBM z/OS Ansible core Version

v1.6.0

ansible-version

v2.15.x

z/OS version

v2.5

Ansible module

zos_operator

Bug description

Hi team, Brief summary of the z/OS environments:

This is the playbook being ran:

---
- hosts: AMXT
  collections:
    - ibm.ibm_zos_core
  gather_facts: no
  environment: "{{ environment_vars }}"

  tasks:

    - name: Run command
      ibm.ibm_zos_core.zos_operator:
        cmd: "D IPLINFO"
        verbose: true
      register: out
      ignore_errors: true

    - debug:
        var: out

When running the zos_operator module on both LPARs on the same playbook concurrently, the error below is received:

TASK [debug] ********************************************************************************************************************************************************
ok: [NWRD] => {
    "out": {
        "changed": true,
        "cmd": "D IPLINFO",
        "content": [
            "NWRD       2023230  16:46:15.00             ISF031I CONSOLE LW090000 ACTIVATED",
            "NWRD       2023230  16:46:15.00            -D IPLINFO ",
            "NWRD       2023230  16:46:15.00             IEE254I  16.46.14 IPLINFO DISPLAY 220",
            "                                            SYSTEM IPLED AT 21.48.10 ON 08/15/2023",
            "                                            RELEASE z/OS 02.05.00    LICENSE = z/OS",
            "                                            USED LOADFB IN SYS0.IPLPARM ON 04025",
            "                                            ARCHLVL = 2   MTLSHARE = N",
            "                                            VALIDATED BOOT: NO",
            "                                            IEASYM LIST = (XX,9X,9D,L)",
            "                                            IEASYS LIST = (9D,00,9X,9D) (OP)",
            "                                            IODF DEVICE: ORIGINAL(04025) CURRENT(04025)",
            "                                            IPL DEVICE: ORIGINAL(04004) CURRENT(04004) VOLUME(SRB9X1)",
            "BGYSC0804I Using timeout of 100 centiseconds.",
            "BGYSC0801I CONSBUFPGNUM=128 - Console command output buffer memory size to be allocated: 131072 bytes",
            "BGYSC0802I Console command returned output string size of 371 bytes."
        ],
        "elapsed": 1.31,
        "failed": false,
        "rc": 0,
        "wait_time_s": 1
    }
}
ok: [NWRE] => {
    "out": {
        "changed": false,
        "failed": true,
        "msg": "OperatorCmdError('D IPLINFO', 12, ['', 'Out: ', 'Err: BGYSC0804I Using timeout of 100 centiseconds.', 'BGYSC0801I CONSBUFPGNUM=128 - Console command output buffer memory size to be allocated: 131072 bytes', 'BGYSC0808E Unknown error from JMONCONS.', '', 'Ran: D IPLINFO'])"
    }
}

Sometimes the first host (NWRD) gets the OperatorCmdError and the second host (NWRE) completes well, vice-versa.

Using wait = True and wait_time_s did not present any differences during the run. It failed the same way.

When using throttle = 1, both hosts complete successfully.

Playbook verbosity output.

playbook_out.txt

Ansible configuration.

luiggi@LAPTOP-O3KP4MD1:~/ansible$ ansible-config view
# Since Ansible 2.12 (core):
# To generate an example config file (a "disabled" one with all default settings, commented out):
#               $ ansible-config init --disabled > ansible.cfg
#
# Also you can now have a more complete file by including existing plugins:
# ansible-config init --disabled -t all > ansible.cfg

# For previous versions of Ansible you can check for examples in the 'stable' branches of each version
# Note that this file was always incomplete  and lagging changes to configuration settings

# for example, for 2.9: https://github.com/ansible/ansible/blob/stable-2.9/examples/ansible.cfg

### Contents of the inventory

```YAML
# Inventory for American Express systems
all:
  hosts:
    localhost:
      vars:
        ansible_connection: local
  children:
    sandbox:
      children:
        AMXT:
          hosts:
            NWRD:
            NWRE:
    # Plex definitions - Development
    development:
      children:
        ADC:
          hosts:
            DIPD:
            DIPN:
            DIPO:
        EDEV:
          hosts:
            DIPQ:
        IPCD:
          hosts:
            NWRC:
        PDEV:
          hosts:
            DIPM:
    # Plex definitions - Pre-production (ARENA)
    arena:
      children:
        ARENA_EPRD:
          hosts:
            ADPU:
        ARENA_IPRD:
          hosts:
            AMFS:
            AMFT:
        ARENA_LPRD:
          hosts:
            ADPK:
        ARENA_MPRD:
          hosts:
            ADPJ:
        ARENA_PPRD:
          hosts:
            ADPS:
        ARENA_SROC:
          hosts:
            ADPE:
            ADPL:
        ARENA_WROC:
          hosts:
            ADPA:
            ADPI:
    # Plex definitions - Production
    production:
      children:
        EPRD:
          hosts:
            DIPU:
            DIPY:
            DIPZ:
        IPRD:
          hosts:
            PMFS:
            PMFT:
        LPRD:
          hosts:
            DIPK:
            DIPR:
        MPRD:
          hosts:
            DIPJ:
        NPRD:
          hosts:
            NMNC:
            NWRA:
        PPRD:
          hosts:
            DIPS:
            DIPT:
        SROC:
          hosts:
            DIPE:
            DIPG:
            DIPH:
            DIPL:
        WROC:
          hosts:
            DIPA:
            DIPB:
            DIPC:
            DIPI:
  vars:
    ansible_ssh_pipelining: false

### Contents of `group_vars` or `host_vars`

```YAML
**group_vars/AMXT.yml**

PYZ: "/usr/lpp/IBM/cyp/v3r11/pyz"
ZOAU: "/usr/lpp/IBM/zoautil"
ansible_python_interpreter: "{{ PYZ }}/bin/python3"
environment_vars:
    ZOAU_HOME: "{{ ZOAU }}"
    #PYTHONPATH: "{{ ZOAU }}/lib"
    LIBPATH: "{{ ZOAU }}/lib:{{ PYZ }}/lib:/usr:/usr/lib:/lib"
    PATH: "{{ ZOAU }}/bin:{{ PYZ }}/bin:/bin:/usr/sbin:/usr/bin"
    _BPXK_AUTOCVT: "ON"
    _CEE_RUNOPTS: "FILETAG(AUTOCVT,AUTOTAG) POSIX(ON)"
    _TAG_REDIR_ERR: "txt"
    _TAG_REDIR_IN: "txt"
    _TAG_REDIR_OUT: "txt"
    LANG: "C"

host_vars/NWRD.yml

ansible_host: nwrd.ipc.us.aexp.com

host_vars/NWRE.yml

ansible_host: nwre.ipc.us.aexp.com
LuiggiTorricelli commented 1 year ago

Sorry, some content on the Ansible Configuration section is wrongly displayed because of missing markdown identifiers.

ddimatos commented 11 months ago

@LuiggiTorricelli thank you for reporting this, lets start off with the first point in the issue:

ansible_ssh_pipelining is set to False due to previous issues with non-UTF8 encoding characters.

This should have been resolved with a new environment variable PYTHONSTDINENCODING that is available in PTFs:

Property PYTHONSTDINENCODING should be set to the encoding Unix System Services is configured as, supported encodings are ASCII or EBCDIC. This environment variable is used to instruct Ansible which encoding it will pipe content to Python's STDIN (standard in) when pipelining=true is set in ansible.cfg . This environment variable will only apply when using IBM Enterprise Python 3.10 or later, otherwise, it is ignored. For example:PYTHONSTDINENCODING: "cp1047".

You can see an example in our samples and discussion here.

If you can try that, that may address your pipelining comment.

ddimatos commented 11 months ago

@LuiggiTorricelli for the second issue, you provided some good data points, particularly the use of throttle and that this happens when running the playbook against 2 lpar(s). I have an idea why the issue is happening but must confirm it first, we will need to recreate this first and then I will report back.

For now I have put it in our backlog to be recreated, it could take a bit of time.

For internal reference, it might be related to tracker 10091.

ddimatos commented 5 months ago

I had a discussion with ZOAU and we came to the conclusion this might be related to the console ID, and are thinking to backlog this work item until issue #1308 completes which will create a unique console ID per playbook, this could solve this issue. Discussion was recorded internally at url: [archives/C037EFBNPAN/p1712295106635239]