SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
246 stars 92 forks source link

CCKD dasd corruption for unknown reason (was: cckd trace difficulties) #575

Closed davekreiss closed 1 year ago

davekreiss commented 1 year ago

I am fighting a I/O problem. Running z/OS 2.1 under Hercules version 4.6.0.10941-SDL-g65c97fd6 on Windows 10 Home 64 bit 19045.3086 (though Hercules 4.5 has same difficulties).

Introduction

The program in question has a WTOR to stop program at point prior to problem.

At that point I start an GTF IO trace for the device involved.

Then I enter the Hercules command cckd trace=200000,debug=1. (I've also tried 190000 and 100000 and debug 1/0 in many different combinations)

Then respond to the WTOR which lets the program run.

When the program hits the end of the area in code after where the IO issue has occurred another WTOR is issued.

I then stop GTF and enter Hercules command k to dump/display the cckd trace table.

One time I received a trace that covered half the time between the WTORs, but in the current instance there is no trace output at all covering the WTORs period.

As you will see in the enclosed log, sometimes I get a trace and sometimes I get nothing. (I started cckd trace several times and tried many k commands.)

So why all this process to get the trace?

The I/O problem I am trying to solve involves some form of corruption of a z/OS data set. The program being debugged is a MVS 3.8 version of SMP run assembling 1230 programs followed by SMP then linking those programs to distribution libraries. I have front ended the linkage editor with my own program which determines which link is the flawed one and issue the first WTOR upon detecting the flawed link edit. When that link edit is complete the second WTOR is issued.

The corruption is to a data set that is:

  1. not a load library so shouldn't be referenced as part of link edit output, and:

  2. though there is a DD statement for that data set in the JCL, SMP doesn't invoke link edit with that data set as any data set as the output data set.

Interestingly, the corruption is to a non-load library data set - AHELP, a simple 80 byte fixed block PDS containing HELP info. The corruption occurs in the directory portion of the AHELP data set, and is very consistent through all runs. It is reproducible and always occurs between the WTORs.

I have verified the GTF trace contains no define extent for where the directory blocks being overwritten. More info as to why and what was corrupted can be shown.

There are several interesting variations to the corruption:

  1. it didn't happen years ago (can't remember when or versions of Hercules it was successful), and:

  2. running the same process on MVS 3.8 (both TK3 and TK4-) the corruption occurs on other data sets.  (complex Hercules versions here as well, and I haven't yet done testing with older Hercules versions.)

Those seem corrupted data sets are consistent, but I have concentrated on the z/OS 2.1 corruption since it easier to research using its various diagnostic tools. Also, I can't fall back to some ancient Hercules version as I am now running with CCKD64 disk, though the disk in question with the corruption is not a CCKD64 disk.

For now, all I wanted to do was correlate the cckd trace with the GTF trace. I wrote a program which does that, but only once did I get a matching GTF and cckd trace. Unluckily the cckd that trace didn't include the complete time frame between the two WTORs.

What is enclosed:

Hercules log of latest attempted tracing:

Problem Summary:

So, there are two problems:

  1. the cckd trace command

  2. corruption, but I have to this point been unable to pinpoint who is responsible: a. Hercules, or: b. whatever software running on z/OS (and MVS 3.8).

This story is a lot longer but for now I'd like to be able to get a complete cckd trace between the two WTORs working so I can do further analysis. I did read about the instruction trace to a file and was hoping for the same feature with cckd but help doesn't doc that feature.

wrljet commented 1 year ago

Does that mean, this:

Fixed by commit https://github.com/SDL-Hercules-390/hyperion/commit/bd8362828d379105f7629c3b97df000dd17d1071.

... is a red herring?

Fish-Git commented 1 year ago

Does that mean, this:

Fixed by commit bd83628.

... is a red herring?

NO! That was the real bug!

I was talking about the other bug that I thought existed but turns out it actually doesn't.

If you follow the link provided in my comment, you will see what "bug" (that turns out wasn't actually a bug at all) that I was referring to.


To be absolutely clear...

... for the benefit of others reading through this long and perhaps confusing GitHub Issue thread, the PRIMARY (i.e. ACTUAL) bug that this GitHub Issue reported, was indeed identified and fixed by commit bd8362828d379105f7629c3b97df000dd17d1071.

The Secondary (false) bug that I originally believed to exist, is the one that turned out to be a red herring.

Again, this GitHub Issue itself does indeed describe a legitimate bug (dasd corruption) which was indeed found and fixed. While chasing this bug, I mistakenly believed I might have found another bug. It was this secondary bug that was the one that turned out to not exist. The primary bug DID exist (and was fixed). The secondary bug did not and was not.