ansible-collections / ibm_zos_core

Red Hat Ansible Certified Content for IBM Z
76 stars 44 forks source link

[Bug] [zos_backup_restore] add parameter to tolerate enqueue #531

Closed azarrafa closed 1 week ago

azarrafa commented 1 year ago

Bug description

When you run the module zos_backup_restore with operation: backup with an enqueued dataset it fails is there a way to add a parameter to tolerate enqueue ( TOLERATE(ENQF) ).

Playbook verbosity output

fatal: [XXXX]: FAILED! => { "backup_name": "", "changed": false, "invocation": { "module_args": { "backup_name": "/u/users/XXXXX/zowev2/temp_backup.dzp", "data_sets": { "exclude": null, "include": "PSP.AAM." }, "full_volume": false, "hlq": null, "operation": "backup", "overwrite": true, "recover": false, "sms_management_class": null, "sms_storage_class": null, "space": 300, "space_type": "M", "temp_volume": null, "volume": null } }, "message": "", "msg": "ZOAUException(\"BGYSC3904E Failed to dump contents for zip file, see RC for MVSCmd return code\n1PAGE 0001 5695-DF175 DFSMSDSS V2R05.0 DATA SET SERVICES 2022.287 10:09\n- DUMP OUTDD(ARCHIVE) OPTIMIZE(4) DS(INCL( -\n PSP.AAM., -\n ) -\n )\n ADR101I (R/I)-RI01 (01), TASKID 001 HAS BEEN ASSIGNED TO COMMAND 'DUMP '\n ADR109I (R/I)-RI01 (01), 2022.287 10:09:11 INITIAL SCAN OF USER CONTROL STATEMENTS COMPLETED\n ADR016I (001)-PRIME(01), RACF LOGGING OPTION IN EFFECT FOR THIS TASK\n0ADR006I (001)-STEND(01), 2022.287 10:09:11 EXECUTION BEGINS\n0ADR788I (001)-DIVSM(03), PROCESSING COMPLETED FOR CLUSTER PSP.AAM.CSI, 4914 RECORD(S) PROCESSED, REASON 0\n0ADR412E (001)-DTDSC(03), DATA SET PSP.AAM.CFF7LINK IN CATALOG ICF.PROD.USERCAT ON VOLUME PRD004 FAILED \n SERIALIZATION\n0ADR412E (001)-DTDSC(03), DATA SET PSP.AAM.PARMLIB IN CATALOG ICF.PROD.USERCAT ON VOLUME PRD003 FAILED SERIALIZATION\n0ADR801I (001)-DTDSC(01), 2022.287 10:09:23 DATA SET FILTERING IS COMPLETE. 23 OF 25 DATA SETS WERE SELECTED: 2 \n FAILED SERIALIZATION AND 0 FAILED FOR OTHER REASONS\n0ADR454I (001)-DTDSC(01), THE FOLLOWING DATA SETS WERE SUCCESSFULLY PROCESSED\n0 PSP.AAM.AFF7HFS\n0 PSP.AAM.AFF7JCL0\n0 PSP.AAM.AFF7MOD0\n0 PSP.AAM.AFF7OPTN\n0 PSP.AAM.AFF7PROC\n0 PSP.AAM.AFF7XML\n0 PSP.AAM.CFF7JCL0\n0 PSP.AAM.CFF7OPTN\n0 PSP.AAM.CFF7PROC\n0 PSP.AAM.CFF7XML\n0 PSP.AAM.ERROR.HOLDDATA\n0 PSP.AAM.FF720G0.SMPMCS\n0 PSP.AAM.SAMPJCL\n0 PSP.AAM.SMPLOG\n0 PSP.AAM.SMPLOGA\n0 PSP.AAM.SMPLTS\n0 PSP.AAM.SMPMTS\n0 PSP.AAM.SMPPTS\n1PAGE 0002 5695-DF175 DFSMSDSS V2R05.0 DATA SET SERVICES 2022.287 10:09\n- PSP.AAM.SMPPTS1\n0 PSP.AAM.SMPSCDS\n0 PSP.AAM.SMPSTS\n0 CLUSTER NAME PSP.AAM.CFF7USS\n0 CATALOG NAME ICF.PROD.USERCAT\n0 COMPONENT NAME PSP.AAM.CFF7USS.DATA\n0 CLUSTER NAME PSP.AAM.CSI\n0 CATALOG NAME ICF.PROD.USERCAT\n0 COMPONENT NAME PSP.AAM.CSI.DATA\n0 COMPONENT NAME PSP.AAM.CSI.INDEX\n0ADR006I (001)-STEND(02), 2022.287 10:09:23 EXECUTION ENDS\n0ADR013I (001)-CLTSK(01), 2022.287 10:09:23 TASK COMPLETED WITH RETURN CODE 0008\n0ADR012I (SCH)-DSSU (01), 2022.287 10:09:23 DFSMSDSS PROCESSING COMPLETE. HIGHEST RETURN CODE IS 0008 FROM:\n TASK 001\n,RC=8\")" }

Contents of ansible.cfg

No response

Contents of the inventory

No response

Contents of group_vars or host_vars

No response

Ansible version

ansible [core 2.12.7]
  config file = /home/zowe/.ansible.cfg
  configured module search path = ['/home/zowe/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.8/site-packages/ansible
  ansible collection location = /home/zowe/.ansible/collections:/home/zowe/.ansible/collections/ansible_collections/ibm/ibm_zos_core
  executable location = /usr/bin/ansible
  python version = 3.8.13 (default, Jun 24 2022, 15:27:57) [GCC 8.5.0 20210514 (Red Hat 8.5.0-13)]
  jinja version = 2.11.3
  libyaml = True

IBM z/OS Ansible core Version

v1.3.5

IBM ZOAU version

v1.2.0

z/OS version

No response

Ansible module

zos_backup_restore

ddimatos commented 1 year ago

Hi @azarrafa -Thank you for opening up this issue. It is a reasonable ask, it will require some investigation on our part, I don't admit to having ever run ADRDSSU with TOLERATE(ENQFAILURE), we'll investigate report back here. On another note, I was honestly questioning the value of this module over separate archive and un-archive utilities, if you have any thoughts on that point would be great to hear them. In the meantime, we will look into this and get back to you, we have a pretty full quarter, let me see what we can do.

couckearthur commented 5 months ago

@ddimatos I think this was implemented?

If I use 'recover: true' the TOL(ENQF IOER) option is used. But I have another problem that the module still fails on RC=4 from ADRDSSU which it should not do.

I have done some debugging and I think the problem is in ZOAU (I'm using 2023/11/10 18:41:54 CUT V1.3.0.0 a683b8eb 4420 HAL5130 1064 343c1cf2).

Using the native ZOAU DZIP module with following command: dzip -v -d -D -f on a dataset that is in use. ADRDSSU is giving a return code 4 but that is what we expect (using ADRDSSU in JCL jobs we would let this cancel only in very rare cases). The operation is actually successful keeping in mind we used the '-f' (recover:true in zos_core).

unfortunately the ZOAU coding is deleting the dump dataset:

`+ exit 0004

  1. What should I do with it? open a GitHub issue here or open a ticket with IBM support? debug.txt

As a side note I don't really like the term 'zip' being used in this context. We are dumping with ADRDSSU and tersing it afterwards. It is important that we native mainframers can still follow what is happening. Using zip lets me thinks about compressing on USS (but I guess this is also for ZOAU).

couckearthur commented 5 months ago

I've opened an IBM support ticket for the ZOAU team as I realised the zos_core team can't do much about it

ddimatos commented 5 months ago

@couckearthur - I think the response from us was overdue, it could have been that the label waiting for response was left on accidentally. Anyhow to your points, I did review the source and under the right condition either are used, TOL(ENQF IOER) or TOL(IOER), so yes this was implemented.

As for the RC4, while I agree that is generally an acceptable RC (even 8 in some cases) , I think a reasonable ask would be to increase the value to 4 or allow it to be configurable. I will go look for the support ticket and see what approach they are going to take so Ansible can plan for a change or not, we regularly interlock with the team and drive requirements.

As for your original question, what to do if this is ZOAU, we used to have verbiage on this in our issues, updated doc will be out soon addressing support, but in general, the community support available here for this collection is limited only to the collection, not the dependencies or the entire Ansible stack (eg, ansible-core, Ansible Automation Platform, Execution Environments, ZOAU, Python, etc), thus you would need the ability to open a support case on ZOAU through IBM and/or Ansible Automation Platform. Community support comes with no SLA or ability to set the severity, this is decided on by the team which gets weighed against the planned work items, S&S cases and commitments. The team triages all open issues each Thursday.

That being said, bugs that come into this community are taken seriously even if there is no S&S and if the team sets the severity high enough , the team will engage ZOAU as it is impacting the offering; for example the non-printable char issue coming from submitting jobs was released in ZOAU 1.3 but the team drove the back port to ZOAU 1.2 in our last 1.9.0 GA release.

I will reach out to ZOAU and see what their plan is on this item.

couckearthur commented 4 months ago

@ddimatos My apologies for the late reply, I've must have missed the notification.

In the meantime I had the IBM support case. I've explained the issue and the solution. The fix should be implemented in 1.3.2.

For now I could circumvent to allow for RC 4 with some minor changes in the dziphelper.

With kind regards, Arthur Coucke

ddimatos commented 3 months ago

ZOAU is providing a fix for this in ZOAU 1.3.2 , case is TS015848350. For Ansible, this is a validation task, a test case using our lock to ensure at data set is enqueued while performing a backup with 'recover: true' to ensure the keyword TOL(ENQF IOER) is exercised.

This will need triage when it can be done.

ddimatos commented 1 week ago

I did not share the JIRA that corresponds to the IBM case TS015848350 , it is NAZARE-10554.