SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
246 stars 92 forks source link

CCKD dasd corruption for unknown reason (was: cckd trace difficulties) #575

Closed davekreiss closed 1 year ago

davekreiss commented 1 year ago

I am fighting a I/O problem. Running z/OS 2.1 under Hercules version 4.6.0.10941-SDL-g65c97fd6 on Windows 10 Home 64 bit 19045.3086 (though Hercules 4.5 has same difficulties).

Introduction

The program in question has a WTOR to stop program at point prior to problem.

At that point I start an GTF IO trace for the device involved.

Then I enter the Hercules command cckd trace=200000,debug=1. (I've also tried 190000 and 100000 and debug 1/0 in many different combinations)

Then respond to the WTOR which lets the program run.

When the program hits the end of the area in code after where the IO issue has occurred another WTOR is issued.

I then stop GTF and enter Hercules command k to dump/display the cckd trace table.

One time I received a trace that covered half the time between the WTORs, but in the current instance there is no trace output at all covering the WTORs period.

As you will see in the enclosed log, sometimes I get a trace and sometimes I get nothing. (I started cckd trace several times and tried many k commands.)

So why all this process to get the trace?

The I/O problem I am trying to solve involves some form of corruption of a z/OS data set. The program being debugged is a MVS 3.8 version of SMP run assembling 1230 programs followed by SMP then linking those programs to distribution libraries. I have front ended the linkage editor with my own program which determines which link is the flawed one and issue the first WTOR upon detecting the flawed link edit. When that link edit is complete the second WTOR is issued.

The corruption is to a data set that is:

  1. not a load library so shouldn't be referenced as part of link edit output, and:

  2. though there is a DD statement for that data set in the JCL, SMP doesn't invoke link edit with that data set as any data set as the output data set.

Interestingly, the corruption is to a non-load library data set - AHELP, a simple 80 byte fixed block PDS containing HELP info. The corruption occurs in the directory portion of the AHELP data set, and is very consistent through all runs. It is reproducible and always occurs between the WTORs.

I have verified the GTF trace contains no define extent for where the directory blocks being overwritten. More info as to why and what was corrupted can be shown.

There are several interesting variations to the corruption:

  1. it didn't happen years ago (can't remember when or versions of Hercules it was successful), and:

  2. running the same process on MVS 3.8 (both TK3 and TK4-) the corruption occurs on other data sets.  (complex Hercules versions here as well, and I haven't yet done testing with older Hercules versions.)

Those seem corrupted data sets are consistent, but I have concentrated on the z/OS 2.1 corruption since it easier to research using its various diagnostic tools. Also, I can't fall back to some ancient Hercules version as I am now running with CCKD64 disk, though the disk in question with the corruption is not a CCKD64 disk.

For now, all I wanted to do was correlate the cckd trace with the GTF trace. I wrote a program which does that, but only once did I get a matching GTF and cckd trace. Unluckily the cckd that trace didn't include the complete time frame between the two WTORs.

What is enclosed:

Hercules log of latest attempted tracing:

Problem Summary:

So, there are two problems:

  1. the cckd trace command

  2. corruption, but I have to this point been unable to pinpoint who is responsible: a. Hercules, or: b. whatever software running on z/OS (and MVS 3.8).

This story is a lot longer but for now I'd like to be able to get a complete cckd trace between the two WTORs working so I can do further analysis. I did read about the instruction trace to a file and was hoping for the same feature with cckd but help doesn't doc that feature.

Fish-Git commented 1 year ago

I am only slightly familiar with the cckd tracing command only insomuch as it provides additional details regarding a given I/O above and beyond what is otherwise provided by a simple CCW trace. It seems to trace the internal functioning of Hercules CCKD logic, tracing each major CCKD function that is called during the processing of a given CCW I/O. I am otherwise not very familiar with it at all. The cckd command (and specifically the tracing side of it) was written by the same Hercules developer who wrote Hercules's support for CCKD (Compressed CKD) itself. Prior to that we only had plain CKD. He designed and wrote ALL of our compressed CKD (CCKD) logic years ago, and unfortunately he's dead, having passed away from rectal cancer.   :(

After spending a few minutes looking at the code, it appears that the cckd command's trace= option controls the size of the consolidated, system wide internal CCKD trace table that accumulates formatted CCKD trace entries for any/all CCKD I/O for any/all devices. That is to say, when a non-zero trace= value is specified for the cckd command, CCKD tracing entries are saved in the table for any/all CCKD devices, which can then of course be dumped/displayed at any time via the 'k' command.

The cckd debug= option however, controls whether a given internal CCKD trace entry is either, or also, immediately printed at the moment the trace entry is created, depending on whether CCW tracing is enabled for that specific device.

So the two things -- cckd internal trace table, and cckd debug option -- were apparently designed for two completely different things. The cckd trace=nnnnnn option was designed to keep a record of all internal CCKD activity for any/all devices independently from normal CCW tracing. (Apparently so the Hercules developer could see how CCKD processing for one device interacted/occurred while CCKD activity was simultaneously taking place for other devices).

The cckd debug=1 option however, was designed to be used in conjunction with normal CCW tracing for a specific device. That is to say, the same exact CCKD tracing message is formed (formatted/created) for each of the two options, but for debug=1 the CCKD trace message is printed immediately at the moment it is formed, but only if CCW tracing is enabled for that particular device (i.e. only if the t+dev command was given for that device).

Does that make sense?

So my advice to you is, since, by your own words, you know which dataset is being corrupted (and thus which specific device's I/O is involved with the corruption):

Interestingly, the corruption is to a non-load library data set - AHELP, a simple 80 byte fixed block PDS containing HELP info. The corruption occurs in the directory portion of the AHELP data set, and is very consistent through all runs. It is reproducible and always occurs between the WTORs.

I would do a normal t+dev CCW trace for the device in question (where "dev" of course is the device number of the device holding the AHELP dataset's directory) along with cckd debug=1,trace=0.   (Then use cckd debug=0 and t-dev to disable tracing once you've captured the problem.)

This should provide a very detailed trace for just that one specific device where the corruption is obviously occurring, and should (HOPEFULLY!) tell us what's going on (i.e. should hopefully tell us whether the problem is with Hercules or with your guest).

Does that make sense? Does that help?

Once you create the trace, post your Hercules log here, and we'll take a look at it.

Thanks.

davekreiss commented 1 year ago

Ok, I set up the CCW trace, and believe I found something interesting. A little background of this JOB: The first JOB allocates a series of data sets across 9 mod 3 3390 volumes. The data set in question always ends up in the same cylinder on device 3004 since the volume always starts out empty. This data set always gets corrupted.

In the trace I see this, which seems wrong!

18:00:51 HHC00435I 0:3004 CKD file C:/R/BLDDLB.192: cyl 98 head 0 record 255 kl 255 dl 65535 of 0

Record 255,  Key length 255,  Data length 65535,  all look invalid!

  Next entry in trace:

18:00:51 HHC00431I 0:3004 CKD file C:/R/BLDDLB.192: seeking to cyl 31 head 0

Cylinder 31 head 0 is exactly the corrupted first track in the AHELP data set.

  Then we see this a little farther:

18:00:51 HHC00434I 0:3004 CKD file C:/R/BLDDLB.192: read count orientation data
18:00:51 HHC00435I 0:3004 CKD file C:/R/BLDDLB.192: cyl 31 head 0 record 3 kl 8 dl 256 of 0
18:00:51 HHC00440I 0:3004 CKD file C:/R/BLDDLB.192: updating cyl 31 head 0 record 3 kl 8 dl 256

(i.e. updating the AHELP data set...)

  Examining the define extent CCW, shows cyl 1F (31) is out of that range of cyl 62 head 0 to cyl 6B head E:

18:00:51 HHC01334I 0:3004 CHAN: ORB: IntP:00F535A8 Key:5 LPM:80 Flags:0C280 ....FP....H. L....... CCW:7F29D120
18:00:51 HHC00396I 0:3004 start i/o file[0] bufcur 1471 cache[91]
18:00:51 HHC01315I 0:3004 CHAN: ccw 63400010 7F29D138=>10CC0000 00000000 00620000 006B000E .............,..
18:00:51 HHC01315I 0:3004 CHAN: ccw 63400010 7F29D138=>10CC0000 00000000 00620000 006B000E .............,..
18:00:51 HHC01312I 0:3004 CHAN: stat 0C00, count 0000
18:00:51 HHC01315I 0:3004 CHAN: ccw 08000000 7F29D348
18:00:51 HHC01315I 0:3004 CHAN: ccw 47400010 7B43E128=>01810031 00620000 00620000 07220108 .a..............

  Enclosing the complete from IPL to shutdown Hercules log, and a smaller subset where these trace records are:

I can attach a GTF trace printed out if necessary, as well as a track print if needed, or any other info needed.

A simple volume mapping program (my own utility) of that volume may help in understanding the layout:

                     BLDDLB (3390-1)
--Start-- ---End--- ----Extent---- Data Set Name
CCCC HHHH CCCC HHHH     Size     #
0000 0000 0000 0000        1 T     Volume Label
0000 0001 0000 000E       14 T   1 VTOC
0001 0000 0014 000E       20 C   1 DSK.MVS.NEW.ACMDLIB
0015 0000 001E 000E       10 C   1 DSK.MVS.NEW.AGENLIB
001F 0000 0020 000E        2 C   1 DSK.MVS.NEW.AHELP
0021 0000 0024 000E        4 C   1 DSK.MVS.NEW.AIMAGE
0025 0000 0038 000E       20 C   1 DSK.MVS.NEW.ALPALIB
0039 0000 0047 000E       15 C   1 DSK.MVS.NEW.AMACLIB
0048 0000 0051 000E       10 C   1 DSK.MVS.NEW.AMODGEN
0052 0000 0059 000E        8 C   1 DSK.MVS.NEW.AOSA0
005A 0000 005B 000E        2 C   1 DSK.MVS.NEW.AOSA1
005C 0000 005F 000E        4 C   1 DSK.MVS.NEW.AOSBN
0060 0000 0061 000E        2 C   1 DSK.MVS.NEW.AOSB0
0062 0000 006B 000E       10 C   1 DSK.MVS.NEW.AOSB3
006C 0000 006D 000E        2 C   1 DSK.MVS.NEW.AOSCA
006E 0000 0077 000E       10 C   1 DSK.MVS.NEW.AOSCD
0078 0000 0079 000E        2 C   1 DSK.MVS.NEW.AOSCE

So it definitely looks like Hercules took a wrong turn somewhere. Where and why, I'll let you figure out!  :)

Thanks.

Fish-Git commented 1 year ago

In the trace I see this, which seems wrong!

18:00:51 HHC00435I 0:3004 CKD file C:/R/BLDDLB.192: cyl 98 head 0 record 255 kl 255 dl 65535 of 0

Record 255,  Key length 255,  Data length 65535,  all look invalid!

  I would have to agree!  :(  

Enclosing the complete from IPL to shutdown Hercules log, and a smaller subset where these trace records are:

Thanks. I'll take a look at it the very first moment I get.

I'm kind of in the middle of working on something else right now, and I'd like to finish it up before tackling this problem. I've just made a major breakthrough, and have, I believe, finally solved it. All I need to do now is write up my findings for peer review, which is going to take me a short(?) while. (There's a lot to document!)

Once I've posted my findings though, I'll jump on your (this) issue right away, I promise! Cool?

p.s. I don't suppose there's a fast and easy way for some idiot like me (who knows nothing about MVS / z/OS) to be able to recreate/reproduce this problem, is there? If there is, I'd very much appreciate you passing it along! It would sure help to expedite resolving this issue! Thanks!

davekreiss commented 1 year ago

I don't think there is an easy way to reproduce this problem since it takes multiple jobs which runs a total of over 40 minutes.

I believe that the channel program is related to STOW SVC which is recording the directory entry for a load module given the data recorded in the wrong extent. There are a lot of libraries involved in the process and it of course isn't simple.

I would be willing to run any tests you wish with test versions of Hercules if that is needed. The data sets used for these job streams are the publicly available TK3/TK4- MVS 3.8 source, target and distribution libraries.

I could construct a complete TK3 system with all it takes to run it with simple commands but that would take a few days at a minimum.

Fish-Git commented 1 year ago

I don't think there is an easy way to reproduce this problem since it takes multiple jobs which runs a total of over 40 minutes.

"40 minutes"?! Sheesh! That sucks.  :(

How fast if your Hercules host (i.e. host hardware)? Maybe it would run faster on my system? My system has 3.0GHz Intel Xeon processors. Is your host's processor(s) faster or slower than that? (If it's a newer system I'm guessing it's probably faster. My system is quite a few years old.)

I could construct a complete TK3 system with all it takes to run it with simple commands but that would take a few days at a minimum.

I'm not in any rush. Are you?

Fish-Git commented 1 year ago

FYI:  Other distracting issue resolved. Have begun looking into your issue now. But it's late in the day now, so I'll starting digging into it in earnest first thing tomorrow.

Fish-Git commented 1 year ago

David,

In lieu of constructing for me a complete TK3 system, would it be possible for you to upload your device 3004 "C:/R/BLDDLB.192" CCKD file (along with any shadow files you might have for it, if you're using shadow files that is) to my softdevlabs.com ftp site, at the point immediately before the problem occurs?

Since we've already captured the I/O sequence (series of I/O commands and associated channel programs) that triggers the problem, I'm wondering if maybe I might be able to quickly(?) reproduce the problem here on my system by writing a simple(?) stand alone program that simply issues the exact same I/O sequence(s) (chain of CCWs) that we can obviously see from your trace triggers the problem.

Of course, having a complete TK3 system "with all it takes to run it with simple commands" would be the ideal, but I'm just trying to think of possible alternatives. Technically, the only thing I would technically need would be your device 3004 "C:/R/BLDDLB.192" CCKD file (along with any associated shadow files you might have for it) in the state it was in immediately before the problem occurred. Then if I issue the same sequence of channel programs against it, I should technically then be able to reproduce (trigger) the problem. Hopefully.

Anyway, it's just a thought. I would prefer the complete TK3 system of course (so that I can reproduce the problem at will anytime I want), but as I said, I'm just trying to think of alternative approaches.

Thanks.

davekreiss commented 1 year ago

So, the TK3 corruption doesn't show itself in the same way as 2.1 corruption. I haven't quite narrowed down the set links that the corruption occurs in, and it is on a different volume and different data set as well. Still working on that but it is a process of narrowing down the specific links and takes a little time.

So, I have included the BLDDLB volume at 3004:

So, the job is a series of link edits that produce the channel program in question. At the point of my intercept just before invoking IEWL (linkage editor) a WTOR is issued. At that point I cancelled the job, shutdown the OS and then zipped up the volume. There are no shadows for this volume as all the data there is totally reproducible.

History of how that volume is created: As part of the JCL stream is a step which removes all data sets from that volume and reallocates and populates them via SMP.

I've included a printout of the GTF trace for which I've already uploaded the CCW trace. I found the first occurrence of the channel program in question searching for 8D400108 7B441654. The GTF CCW trace contains dump of all the data along with the channel program:

davekreiss commented 1 year ago

I have reproduced the same strange "record 255 kl 255 dl 65535" on TK3, though it occurs on a different volume and data set.

I will setup the TK3 system so a single job is ready in the queue with a WTOR to the console just before the link edit occurs. At that point, a t+193 can be done. Then any response to the WTOR will run that specific link edit.

Once that link edit has completed, another WTOR will come up. At that point, enter t-193 to stop the trace, and you have a reproduction.

I will include an example so you can see how it works along with a .zip file of all the volumes etc. to run it. It will take a little time to get that together, but I am including the Hercules log with the reproduction I just recreated.

No problem on how quick you can get this together since I have had this problem probably since at least Hercules 4.41, and maybe as far back as v4.00. I always assumed it was an OS problem.

I have created a TK3 system with instructions, but the resulting two zip files are over 4G (split in two zip files). The limit here is 25M, so we will need to find a way to drop these files to you.

Fish-Git commented 1 year ago

I have created a TK3 system with instructions, but the resulting two zip files are over 4G (split in two zip files). The limit here is 25M, so we will need to find a way to drop these files to you.

You can either FTP upload it to my SoftDevLabs FTP Server's "incoming" folder:

and then let me know when you've done so, so that I can then download it.

Note:  the "incoming" folder is write-only. As a result, you will not be able to "see" what you've uploaded, because doing a 'dir' (or 'ls') to list the directory's contents is necessarily a read operation, and as I just said, the folder (directory) is marked write-only. It does not have read permission.

OR... you can of course upload it to some other place of your own choosing if you wish, and just give me the URL so that I can then download it from there.

Your choice.

davekreiss commented 1 year ago

Never FTPed to a write only server and couldn't figure out how. I dropped on dropbox. Hope these links work.

TK3-1.ZIP (file://DAVES5510/Users/Dave/Dropbox/TK3-1.ZIP) TK3-2.ZIP (file://DAVES5510/Users/Dave/Dropbox/TK3-2.ZIP)

Dave

davekreiss commented 1 year ago

Well that was futile my Dropbox only takes 2G and I never use it and didn’t realize that. I need a lesson on FTP in order to send the files. Send me a windows procedure and let’s see if I can follow them. Dave

Fish-Git commented 1 year ago

(unrelated to FTP)

FYI: Please see this comment. Thanks.

Fish-Git commented 1 year ago

Never FTPed to a write only server and couldn't figure out how.

I need a lesson on FTP in order to send the files.

FTPing to a write-only folder (directory) is no different than FTPing anywhere else. The procedure is exactly the same. The only difference is, as I explained, that you won't be able to list the directory, that's all.

Send me a windows procedure and let’s see if I can follow them.

Command line, or GUI? Because I personally don't do FTPing via the command line. I prefer to use a GUI application on Windows. They're much more user friendly and easier to use IMO. The one I use is called FileZilla. But then I use FileZilla because I do a lot of FTPing too.

If you don't (and I suspect that's likely true), then doing it via the command line is almost as easy. (There's nothing to install; all Windows systems come with ftp.)

Below is a sample command line FTP session with some blank lines inserted for readability.

Notice how you login as user "anonymous". Also notice how you can list any directory's contents via ls -al except for the "incoming" folder's (since it's read-only). But uploading files to that directory (via the put command) still works just fine.

PLEASE ALSO NOTICE how you must use the binary command before issuing your put command!

C:\Users\Fish> ftp www.softdevlabs.com

Connected to softdevlabs.com.
220 (vsFTPd 2.2.2)
User (softdevlabs.com:(none)): anonymous

230 Login successful.

ftp> ls

200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.

AeroAdmin
Fish Ubuntu 18.04.3
SCSI
dosvs
incoming
links
manuals
misc
networking
politics
schuknecht
shedlock

226 Directory send OK.
ftp: 121 bytes received in 0.00Seconds 121000.00Kbytes/sec.

ftp> ls -al

200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.

drwxrwxr-x   14 500      500          4096 Mar 30 23:04 .
drwxrwxr-x   14 500      500          4096 Mar 30 23:04 ..
drwxrwxr-x    2 500      500          4096 Jun 02  2018 AeroAdmin
drwxrwxr-x    3 500      500          4096 Mar 30 23:04 Fish Ubuntu 18.04.3
drwxrwxr-x    4 500      500          4096 Aug 13  2014 SCSI
drwxrwxr-x    3 500      500          4096 Jan 24  2016 dosvs
drwx-wx-wx    2 500      500          4096 Jun 15 01:40 incoming
drwxrwxr-x    2 500      500          4096 Aug 13  2014 links
drwxrwxr-x    5 500      500          4096 Jan 20  2020 manuals
drwxrwxr-x    2 500      500          4096 Jun 06  2016 misc
drwxrwxr-x    4 500      500          4096 Sep 13  2014 networking
drwxrwxr-x    3 500      500          4096 Jun 29  2022 politics
drwxrwxr-x    2 500      500          4096 Mar 19  2015 schuknecht
drwxrwxr-x    3 500      500          4096 May 26  2018 shedlock

226 Directory send OK.
ftp: 912 bytes received in 0.00Seconds 912000.00Kbytes/sec.

ftp> cd manuals

250 Directory successfully changed.

ftp> ls -al

200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.

drwxrwxr-x    5 500      500          4096 Jan 20  2020 .
drwxrwxr-x   14 500      500          4096 Mar 30 23:04 ..
drwxrwxr-x    2 500      500          4096 Sep 17  2016 DOS
drwxrwxr-x    2 500      500          4096 Sep 17  2016 DOS-VS
drwxrwxr-x    2 500      500          4096 Sep 17  2016 DOS-VSE
-rw-rw-r--    1 500      500      16340053 Sep 17  2016 GC24-5140-00 DOS-VSE Macro Reference.pdf
-rw-rw-r--    1 500      500      11112344 Sep 17  2016 GC28-6515-06 System-360 FORTRAN IV Language.pdf
-rw-rw-r--    1 500      500      26797143 Sep 17  2016 GC28-6515-11 System-360 and System-370 FORTRAN IV Language.pdf
-rw-rw-r--    1 500      500            79 Jan 24  2016 _IBM Manuals at BitSavers.Org.url

226 Directory send OK.
ftp: 723 bytes received in 0.00Seconds 723000.00Kbytes/sec.

ftp> cd ..

250 Directory successfully changed.

ftp> cd incoming

250 Directory successfully changed.

ftp> ls -al

200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
226 Transfer done (but failed to open directory).

ftp> binary           <-----------------  VERY IMPORTANT COMMAND! --------------<<<<

200 Switching to Binary mode.

ftp> put c:\your-path-to-your-file\whatever.xxx

200 PORT command successful. Consider using PASV.
150 Ok to send data.
226 Transfer complete.
ftp: 2432 bytes sent in 0.10Seconds 24.32Kbytes/sec.

ftp> ls -al

200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
226 Transfer done (but failed to open directory).

ftp> bye

221 Goodbye.

C:\Users\Fish>

Hope that helps!

davekreiss commented 1 year ago

Well tried ftp and not successful. So I turned off both webroot firewall and also Microsoft's firewall with no success as well.

PS C:\Users\Dave\downloads> ftp www.softdevlabs.com Connected to softdevlabs.com. 220 (vsFTPd 2.2.2) 200 Always in UTF8 mode. User (softdevlabs.com:(none)): anonymous 230 Login successful. ftp> cd incoming 250 Directory successfully changed. ftp> binary 200 Switching to Binary mode. ftp> put TK3-1.ZIP 200 PORT command successful. Consider using PASV. 425 Failed to establish connection. ftp> ftp> put TK3-1.ZIP 200 PORT command successful. Consider using PASV. 553 Could not create file. ftp> put TK3-2.ZIP 200 PORT command successful. Consider using PASV. 425 Failed to establish connection. ftp> ls -al 200 PORT command successful. Consider using PASV. 425 Failed to establish connection. ftp> ftp> put TK3-1.ZIP 200 PORT command successful. Consider using PASV. 553 Could not create file. ftp> put TK3-1-try2.ZIP 200 PORT command successful. Consider using PASV. 425 Failed to establish connection. ftp> bye 221 Goodbye.

But I was able, as it turns out, to get one file to dropbox and one to idrive. Here are links. I hope you can access them.

https://1drv.ms/u/s!Ajp_FMPdPVCk0TSs4LE9nu6YxRTw https://www.dropbox.com/s/ab0gjwobl5ju4u6/TK3-2.ZIP

Dave

Fish-Git commented 1 year ago

Are you sure your .zip files exist in your local "C:\Users\Dave\downloads" directory?

Or do they maybe exist somewhere else?

Try specifying its full path, e.g.

put "C:\my\tk3\zip\dir\TK3-1.ZIP"

I hate Window 10/11!  :(

Fish-Git commented 1 year ago

But I was able, as it turns out, to get one file to dropbox and one to idrive. Here are links. I hope you can access them.

https://1drv.ms/u/s!Ajp_FMPdPVCk0TSs4LE9nu6YxRTw https://www.dropbox.com/s/ab0gjwobl5ju4u6/TK3-2.ZIP

I was able to download the "TK3-2.ZIP" from OneDrive, but when I click on the link for DropBox, I get a "This item was deleted" web page.

davekreiss commented 1 year ago

I re-fetched the link and it is different than first:

In the TK3 directory you will find this TK3 IPL and CCW trace instructions.txt. I made a copy of the dasd directory and copied the disks there and used that to restore between IPLs so I had a clean set of disks for each test. Easier than running zip to unzip from two zip files.

Another piece of info - I found a 3.07 version of Hercules and the same error appears to be happening.

Fish-Git commented 1 year ago

I re-fetched the link and it is different than first:

Got it! Thanks!

Things are starting to get a little crazy right now (talking to a lot of people via email, responding to other GitHub issues, especially #572, which I still have a minor issue/bug with that I need to get fixed, etc), but I assure you, Dave, this issue of yours is still at the top of my list, and I promise I will get to it just as soon as I possibly can! Your patience is appreciated.

Fish-Git commented 1 year ago

DAMMIT!  I'm sorry Dave, but as is too frequently the case, something else has once again popped up to temporarily take my focus away from your issue.  :(

GitHub Issue #572 has been re-opened due to a reported problem preventing z/VM from being able to IPL, so working on and resolving it has temporarily jumped ahead of your issue in my list of priorities. But I promise you I have not forgotten about you and you're still right at the top, next in line.

Fish-Git commented 1 year ago

David (@davekreiss),

I began (finally!) working on your issue today. I apologize for having taken so long.

I am unable to reproduce the problem as described, using the TK3 system you sent me.

I suspect the reason may be due to the fact that the TK3 dasd volumes you sent me, are already corrupted!

I need you to send me uncorrupted (i.e. pristine/clean) copies of the TK3 BLDDLB and BLDTGT volumes, so that I can try to catch the corruption in action.

The simple test that I ran:

First, I marked all of the dasds you sent me as read-only, and then created a Hercules configuration file that specified use of shadow files for each of them:

Doing it this way ensures that I can run each of my tests over and over again as needed, starting from the exact same starting point (i.e. state) each time.

After each test, if I want to run it again or try a different technique (test) to try and catch the bug, I simply delete the old set of shadow files from the previous test run before I start Hercules for the new test run, thereby leaving behind just the original unaltered as-delivered dasd volumes and voila! -- the new/next test begins with all of my dasd volumes in their original unaltered as-delivered state.

Afterwards, I can then examine the shadow files to see if there were any updates to the track where the corruption supposedly occurs (cylinder 31 head 0, yes?).

Before powering on Hercules and IPLing (which runs your built-in automated test automatically and supposedly recreates the problem (i.e. the corruption), yes?), I first run some Hercules tools (utilities) called cckddiag and cckdmap to view the current state of the affected dasds.

(Note: it's still unclear to me exactly which dasd volume it is that is supposed to "become corrupted" as a result of your test, but I'm guessing it is the BLDTGT volume, yes? Please confirm this for me! Thanks.)

Here are the cckddiag reports for both Cylinder 31 Head 0 and Cylinder 31 Head 1 for both BLDDLB and BLDTGT as they look BEFORE running your test (i.e. before IPLing the system):

CCKDDIAG -t -a 31 0 Q:/TK3/Dasd/BLDDLB.cckd32

17:30:10.005 CCKDDIAG BLDDLB.cckd32 started; process-id = 6992 (0x00001B50)
17:30:10.036 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:30:10.036 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:30:10.036 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:30:10.036 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:30:10.036 
17:30:10.036 HHC02609I CC 31 HH 0 = reltrk 465; L1 index = 1, L2 index = 209
17:30:10.036 HHC02610I L1 index 1 = L2TAB offset 6612 (0x00000000000019D4)
17:30:10.036 
17:30:10.036 HHC02611I L2 index 209 = L2TAB entry: 8 bytes:
17:30:10.036 +0000 FE1DB000 E504E504                    ....V.V.
17:30:10.036 
17:30:10.036 HHC02613I TRKHDR offset 11542014 (0x0000000000B01DFE); length 1253 (0x000004E5) bytes
17:30:10.036 HHC02612I TRKHDR track 465:
17:30:10.036 +0000 01001F00 00                          .....
17:30:10.036 
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[00/0] kl[0] dl[8]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[01/1] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[02/2] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[03/3] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[04/4] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[05/5] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[06/6] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[07/7] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[08/8] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[09/9] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0A/10] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0B/11] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0C/12] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0D/13] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0E/14] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0F/15] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[10/16] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[11/17] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[12/18] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[13/19] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[14/20] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[15/21] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[16/22] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[17/23] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[18/24] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[19/25] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1A/26] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1B/27] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1C/28] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1D/29] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1E/30] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1F/31] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[20/32] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[21/33] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[22/34] kl[8] dl[256]
17:30:10.036 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[23/35] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[24/36] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[25/37] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[26/38] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[27/39] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[28/40] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[29/41] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2A/42] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2B/43] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2C/44] kl[8] dl[256]
17:30:10.052 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2D/45] kl[8] dl[256]
17:30:10.052 HHC02601I End of track
17:30:10.052 CCKDDIAG BLDDLB.cckd32 ended; rc=0

CCKDDIAG -t -a 31 1 Q:/TK3/Dasd/BLDDLB.cckd32

17:30:18.933 CCKDDIAG BLDDLB.cckd32 started; process-id = 6880 (0x00001AE0)
17:30:18.964 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:30:18.964 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:30:18.964 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:30:18.964 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:30:18.964 
17:30:18.964 HHC02609I CC 31 HH 1 = reltrk 466; L1 index = 1, L2 index = 210
17:30:18.964 HHC02610I L1 index 1 = L2TAB offset 6612 (0x00000000000019D4)
17:30:18.964 
17:30:18.964 HHC02611I L2 index 210 = L2TAB entry: 8 bytes:
17:30:18.964 +0000 05D32900 3F223F22                    .L......
17:30:18.964 
17:30:18.964 HHC02613I TRKHDR offset 2740997 (0x000000000029D305); length 8767 (0x0000223F) bytes
17:30:18.964 HHC02612I TRKHDR track 466:
17:30:18.964 +0000 01001F00 01                          .....
17:30:18.964 
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[00/0] kl[0] dl[8]
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[01/1] kl[0] dl[0]
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[02/2] kl[0] dl[1360]
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[03/3] kl[0] dl[0]
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[04/4] kl[0] dl[5840]
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[05/5] kl[0] dl[0]
17:30:18.964 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[06/6] kl[0] dl[27920]
17:30:18.964 HHC02601I End of track
17:30:18.964 CCKDDIAG BLDDLB.cckd32 ended; rc=0

CCKDDIAG -t -a 31 0 Q:/TK3/Dasd/BLDTGT.cckd32

17:30:37.074 CCKDDIAG BLDTGT.cckd32 started; process-id = 4320 (0x000010E0)
17:30:37.105 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:30:37.105 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:30:37.105 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:30:37.105 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:30:37.105 
17:30:37.105 HHC02609I CC 31 HH 0 = reltrk 465; L1 index = 1, L2 index = 209
17:30:37.105 HHC02610I L1 index 1 = L2TAB offset 7341 (0x0000000000001CAD)
17:30:37.105 
17:30:37.105 HHC02611I L2 index 209 = L2TAB entry: 8 bytes:
17:30:37.105 +0000 BF6BC400 3D083D08                    .,D.....
17:30:37.105 
17:30:37.121 HHC02613I TRKHDR offset 12872639 (0x0000000000C46BBF); length 2109 (0x0000083D) bytes
17:30:37.121 HHC02612I TRKHDR track 465:
17:30:37.121 +0000 01001F00 00                          .....
17:30:37.121 
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[00/0] kl[0] dl[8]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[01/1] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[02/2] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[03/3] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[04/4] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[05/5] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[06/6] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[07/7] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[08/8] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[09/9] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0A/10] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0B/11] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0C/12] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0D/13] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0E/14] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[0F/15] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[10/16] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[11/17] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[12/18] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[13/19] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[14/20] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[15/21] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[16/22] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[17/23] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[18/24] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[19/25] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1A/26] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1B/27] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1C/28] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1D/29] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1E/30] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[1F/31] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[20/32] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[21/33] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[22/34] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[23/35] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[24/36] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[25/37] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[26/38] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[27/39] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[28/40] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[29/41] kl[8] dl[256]
17:30:37.121 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2A/42] kl[8] dl[256]
17:30:37.137 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2B/43] kl[8] dl[256]
17:30:37.137 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2C/44] kl[8] dl[256]
17:30:37.137 HHC02605I Track 465 COUNT cyl[001F/31] head[0000/0] rec[2D/45] kl[8] dl[256]
17:30:37.137 HHC02601I End of track
17:30:37.137 CCKDDIAG BLDTGT.cckd32 ended; rc=0

CCKDDIAG -t -a 31 1 Q:/TK3/Dasd/BLDTGT.cckd32

17:30:47.922 CCKDDIAG BLDTGT.cckd32 started; process-id = 6032 (0x00001790)
17:30:47.937 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:30:47.937 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:30:47.937 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:30:47.937 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:30:47.937 
17:30:47.953 HHC02609I CC 31 HH 1 = reltrk 466; L1 index = 1, L2 index = 210
17:30:47.953 HHC02610I L1 index 1 = L2TAB offset 7341 (0x0000000000001CAD)
17:30:47.953 
17:30:47.953 HHC02611I L2 index 210 = L2TAB entry: 8 bytes:
17:30:47.953 +0000 A0E3BD00 00180018                    .T].....
17:30:47.953 
17:30:47.953 HHC02613I TRKHDR offset 12444576 (0x0000000000BDE3A0); length 6144 (0x00001800) bytes
17:30:47.953 HHC02612I TRKHDR track 466:
17:30:47.953 +0000 01001F00 01                          .....
17:30:47.953 
17:30:47.953 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[00/0] kl[0] dl[8]
17:30:47.953 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[01/1] kl[0] dl[0]
17:30:47.953 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[02/2] kl[0] dl[10560]
17:30:47.953 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[03/3] kl[0] dl[0]
17:30:47.953 HHC02605I Track 466 COUNT cyl[001F/31] head[0001/1] rec[04/4] kl[0] dl[27920]
17:30:47.953 HHC02601I End of track
17:30:47.953 CCKDDIAG BLDTGT.cckd32 ended; rc=0

As you can see, Cylinder 31 Head 0 for both volumes appear to already be corrupted! (whereas Cylinder 31 Head 1 on each appears to still be intact).

Each data block on track 0 are all 256 bytes in size (I'm presuming that's the "corrupted state", yes?) whereas each block on track 1 is always a multiple of 80 bytes (which I presume is how all tracks for the dataset are supposed to look, yes?).

Sooooo....   As I said in my opening comments, it appears the volumes you sent me are already corrupted to begin with.

 

But whatever. I started Hercules and IPLed anyway and let your test do its thing. When the *00 Pause message appeared, I responded as instructed (0) and a split second later the Link ran and ended (*01 Pause).

AT THIS POINT  I then immediately shutdown Hercules and examined the shadow files for each of the two volumes to see if there were any records (any writes/updates) for Cylinder 31 Head 0 and Head 1.

There were NONE.

Here is are the cckddiag and detailed cckdmap output for each of the two shadow files:

CCKDDIAG -t -a 31 0 Q:/TK3/Dasd/BLDDLB_Shadow_1.cckd32

17:35:53.301 CCKDDIAG BLDDLB_Shadow_1.cckd32 started; process-id = 13836 (0x0000360C)
17:35:53.317 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:35:53.317 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:35:53.317 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:35:53.333 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:35:53.333 
17:35:53.333 HHC02609I CC 31 HH 0 = reltrk 465; L1 index = 1, L2 index = 209
17:35:53.333 HHC02610I L1 index 1 = L2TAB offset 4294967295 (0x00000000FFFFFFFF)
17:35:53.333 HHC02618S L2tab for track 465 not found
17:35:53.333 HHC00007I Previous message from function 'main' at cckddiag.c(883)
17:35:53.333 CCKDDIAG BLDDLB_Shadow_1.cckd32 ended; rc=7

CCKDDIAG -t -a 31 1 Q:/TK3/Dasd/BLDDLB_Shadow_1.cckd32

17:36:06.098 CCKDDIAG BLDDLB_Shadow_1.cckd32 started; process-id = 8576 (0x00002180)
17:36:06.129 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:36:06.129 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:36:06.129 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:36:06.129 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:36:06.129 
17:36:06.129 HHC02609I CC 31 HH 1 = reltrk 466; L1 index = 1, L2 index = 210
17:36:06.129 HHC02610I L1 index 1 = L2TAB offset 4294967295 (0x00000000FFFFFFFF)
17:36:06.129 HHC02618S L2tab for track 466 not found
17:36:06.129 HHC00007I Previous message from function 'main' at cckddiag.c(883)
17:36:06.145 CCKDDIAG BLDDLB_Shadow_1.cckd32 ended; rc=7

CCKDDIAG -t -a 31 0 Q:/TK3/Dasd/BLDTGT_Shadow_1.cckd32

17:36:21.171 CCKDDIAG BLDTGT_Shadow_1.cckd32 started; process-id = 11592 (0x00002D48)
17:36:21.186 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:36:21.186 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:36:21.186 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:36:21.186 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:36:21.186 
17:36:21.186 HHC02609I CC 31 HH 0 = reltrk 465; L1 index = 1, L2 index = 209
17:36:21.186 HHC02610I L1 index 1 = L2TAB offset 4294967295 (0x00000000FFFFFFFF)
17:36:21.186 HHC02618S L2tab for track 465 not found
17:36:21.186 HHC00007I Previous message from function 'main' at cckddiag.c(883)
17:36:21.202 CCKDDIAG BLDTGT_Shadow_1.cckd32 ended; rc=7

CCKDDIAG -t -a 31 1 Q:/TK3/Dasd/BLDTGT_Shadow_1.cckd32

17:36:32.084 CCKDDIAG BLDTGT_Shadow_1.cckd32 started; process-id = 4116 (0x00001014)
17:36:32.116 HHC02499I Hercules utility CCKDDIAG - CCKD/CFBA diagnostic program - version 4.6.0.10941-SDL-g65c97fd6
17:36:32.116 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:36:32.116 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:36:32.116 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:36:32.116 
17:36:32.116 HHC02609I CC 31 HH 1 = reltrk 466; L1 index = 1, L2 index = 210
17:36:32.116 HHC02610I L1 index 1 = L2TAB offset 4294967295 (0x00000000FFFFFFFF)
17:36:32.116 HHC02618S L2tab for track 466 not found
17:36:32.116 HHC00007I Previous message from function 'main' at cckddiag.c(883)
17:36:32.116 CCKDDIAG BLDTGT_Shadow_1.cckd32 ended; rc=7

CCKDMAP -v Q:/TK3/Dasd/BLDDLB_Shadow_1.cckd32

17:45:41.998 CCKDMAP -v BLDDLB_Shadow_1.cckd32 started; process-id = 10712 (0x000029D8)
17:45:42.029 HHC02499I Hercules utility CCKDMAP - Compressed dasd file map - version 4.6.0.10941-SDL-g65c97fd6
17:45:42.029 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:45:42.029 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:45:42.029 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:45:42.029 HHC03020I
17:45:42.029 HHC03021I CCKDMAP of:     "Q:/TK3/Dasd/BLDDLB_Shadow_1.cckd32"
17:45:42.029 HHC03007I File size:      (4,803 bytes)
17:45:42.029 HHC03022I
17:45:42.029 HHC03022I   dh_devid:      CKD_S370        (32-bit CCKD shadow file)
17:45:42.029 HHC03022I   dh_heads:      15
17:45:42.029 HHC03022I   dh_trksize:    56832
17:45:42.029 HHC03022I   dh_devtyp:     0x90            (3390)
17:45:42.029 HHC03022I   dh_fileseq:    0x00
17:45:42.029 HHC03022I   dh_highcyl:    0
17:45:42.029 HHC03022I   dh_serial:
17:45:42.029 HHC03023I
17:45:42.029 HHC03023I   cdh_vrm:       0.3.1
17:45:42.029 HHC03023I   cdh_opts:      0x41
17:45:42.029 HHC03023I   num_L1tab:     66
17:45:42.029 HHC03023I   num_L2tab:     256
17:45:42.029 HHC03023I   cdh_cyls:      1113           (16,695 tracks)
17:45:42.029 HHC03023I   cdh_size:      0x00000012C3    (4,803 bytes)
17:45:42.029 HHC03023I   cdh_used:      0x00000012C3    (4,803 bytes)
17:45:42.029 HHC03023I   free_off:      0x0000000000    (old format)
17:45:42.029 HHC03023I   free_total:    0x0000000000    (0 bytes)
17:45:42.029 HHC03023I   free_largest:  0x0000000000    (0 bytes)
17:45:42.029 HHC03023I   free_num:      0
17:45:42.029 HHC03023I   free_imbed:    0
17:45:42.029 HHC03023I   cdh_nullfmt:   1               (ha r0)
17:45:42.029 HHC03023I   cmp_algo:      1               (zlib)
17:45:42.029 HHC03023I   cmp_parm:      -1              (default)
17:45:42.029 HHC03020I
17:45:42.029 HHC03043I Total active tracks    = 2 tracks
17:45:42.045 HHC03044I Avg. L2-to-track seek  = 0.001 MB
17:45:42.045 HHC03020I
17:45:42.045 HHC03040I          File offset    Size (hex)         Size  track(s)
17:45:42.045 HHC03020I
17:45:42.045 HHC03042I devhdr   0x0000000000  0x0000000200         512
17:45:42.045 HHC03042I cdevhdr  0x0000000200  0x0000000200         512
17:45:42.045 HHC03042I l1       0x0000000400  0x0000000108         264
17:45:42.045 HHC03041I ***********************************************************
17:45:42.045 HHC03042I L2LOWER  0x0000000508  0x0000000000           0
17:45:42.045 HHC03042I trk      0x0000000508  0x000000016D         365  2
17:45:42.045 HHC03042I l2       0x0000000675  0x0000000800        2048  0 - 255
17:45:42.045 HHC03041I ***********************************************************
17:45:42.045 HHC03042I L2UPPER  0x0000000D08  0x0000000000           0
17:45:42.045 HHC03042I trk      0x0000000E75  0x000000044E        1102  1
17:45:42.045 HHC03042I eof      0x00000012C3  0x00000012C3        4803
17:45:42.045 HHC03020I
17:45:42.045 CCKDMAP -v BLDDLB_Shadow_1.cckd32 ended; rc=0

CCKDMAP -v Q:/TK3/Dasd/BLDTGT_Shadow_1.cckd32

17:45:58.287 CCKDMAP -v BLDTGT_Shadow_1.cckd32 started; process-id = 7576 (0x00001D98)
17:45:58.318 HHC02499I Hercules utility CCKDMAP - Compressed dasd file map - version 4.6.0.10941-SDL-g65c97fd6
17:45:58.318 HHC01414I (C) Copyright 1999-2023 by Roger Bowler, Jan Jaeger, and others
17:45:58.318 HHC01417I ** The SDL 4.x Hyperion version of Hercules **
17:45:58.318 HHC01415I Build date: Jun  8 2023 at 15:48:12
17:45:58.318 HHC03020I
17:45:58.318 HHC03021I CCKDMAP of:     "Q:/TK3/Dasd/BLDTGT_Shadow_1.cckd32"
17:45:58.318 HHC03007I File size:      (725,428 bytes)
17:45:58.318 HHC03022I
17:45:58.318 HHC03022I   dh_devid:      CKD_S370        (32-bit CCKD shadow file)
17:45:58.318 HHC03022I   dh_heads:      15
17:45:58.318 HHC03022I   dh_trksize:    56832
17:45:58.318 HHC03022I   dh_devtyp:     0x90            (3390)
17:45:58.318 HHC03022I   dh_fileseq:    0x00
17:45:58.318 HHC03022I   dh_highcyl:    0
17:45:58.318 HHC03022I   dh_serial:
17:45:58.318 HHC03023I
17:45:58.318 HHC03023I   cdh_vrm:       0.3.1
17:45:58.318 HHC03023I   cdh_opts:      0x41
17:45:58.318 HHC03023I   num_L1tab:     66
17:45:58.318 HHC03023I   num_L2tab:     256
17:45:58.318 HHC03023I   cdh_cyls:      1113           (16,695 tracks)
17:45:58.318 HHC03023I   cdh_size:      0x00000B11B4    (725,428 bytes)
17:45:58.318 HHC03023I   cdh_used:      0x00000AA94A    (698,698 bytes)
17:45:58.318 HHC03023I   free_off:      0x0000027F95    (NEW format)
17:45:58.318 HHC03023I   free_total:    0x000000686A    (26,730 bytes)
17:45:58.318 HHC03023I   free_largest:  0x00000059E3    (23,011 bytes)
17:45:58.318 HHC03023I   free_num:      4
17:45:58.318 HHC03023I   free_imbed:    0
17:45:58.318 HHC03023I   cdh_nullfmt:   1               (ha r0)
17:45:58.318 HHC03023I   cmp_algo:      1               (zlib)
17:45:58.318 HHC03023I   cmp_parm:      -1              (default)
17:45:58.318 HHC03020I
17:45:58.318 HHC03043I Total active tracks    = 47 tracks
17:45:58.318 HHC03044I Avg. L2-to-track seek  = 0.130 MB
17:45:58.318 HHC03020I
17:45:58.318 HHC03040I          File offset    Size (hex)         Size  track(s)
17:45:58.318 HHC03020I
17:45:58.318 HHC03042I devhdr   0x0000000000  0x0000000200         512
17:45:58.318 HHC03042I cdevhdr  0x0000000200  0x0000000200         512
17:45:58.318 HHC03042I l1       0x0000000400  0x0000000108         264
17:45:58.318 HHC03041I ***********************************************************
17:45:58.318 HHC03042I L2LOWER  0x0000000508  0x0000000000           0
17:45:58.318 HHC03042I trk      0x0000000508  0x00000026D2        9938  2755
17:45:58.333 HHC03042I l2       0x0000002BDA  0x0000000800        2048  2560 - 2815
17:45:58.333 HHC03042I trk      0x00000033DA  0x00000033A5       13221  2756
17:45:58.333 HHC03041I ***********************************************************
17:45:58.333 HHC03042I L2UPPER  0x0000003D08  0x0000000000           0
17:45:58.333 HHC03042I trk      0x000000677F  0x0000003192       12690  2757
17:45:58.333 HHC03042I trk      0x0000009911  0x0000002FAB       12203  2758
17:45:58.333 HHC03042I trk      0x000000C8BC  0x00000033BB       13243  2759
17:45:58.333 HHC03042I trk      0x000000FC77  0x0000003485       13445  2760
17:45:58.333 HHC03042I trk      0x00000130FC  0x0000003242       12866  2761
17:45:58.333 HHC03042I trk      0x000001633E  0x00000031A7       12711  2762
17:45:58.333 HHC03042I trk      0x00000194E5  0x0000003364       13156  2763
17:45:58.333 HHC03042I trk      0x000001C849  0x000000338C       13196  2764
17:45:58.333 HHC03042I trk      0x000001FBD5  0x0000002616        9750  2766
17:45:58.333 HHC03042I trk      0x00000221EB  0x00000032FF       13055  2765
17:45:58.333 HHC03042I trk      0x00000254EA  0x000000140A        5130  2767
17:45:58.333 HHC03042I trk      0x00000268F4  0x00000016A1        5793  2768
17:45:58.333 HHC03042I free     0x0000027F95  0x00000003B5         949
17:45:58.333 HHC03042I l2       0x000002834A  0x0000000800        2048  0 - 255
17:45:58.333 HHC03042I trk      0x0000028B4A  0x000000190B        6411  2535
17:45:58.333 HHC03042I l2       0x000002A455  0x0000000800        2048  2304 - 2559
17:45:58.333 HHC03042I free     0x000002AC55  0x000000071C        1820
17:45:58.333 HHC03042I trk      0x000002B371  0x00000039C4       14788  2004
17:45:58.333 HHC03042I l2       0x000002ED35  0x0000000800        2048  1792 - 2047
17:45:58.333 HHC03042I trk      0x000002F535  0x0000002DAE       11694  2005
17:45:58.333 HHC03042I trk      0x00000322E3  0x00000025FC        9724  2006
17:45:58.333 HHC03042I trk      0x00000348DF  0x0000002116        8470  2007
17:45:58.333 HHC03042I trk      0x00000369F5  0x0000001AA3        6819  2008
17:45:58.333 HHC03042I trk      0x0000038498  0x0000003148       12616  2003
17:45:58.333 HHC03042I trk      0x000003B5E0  0x0000000487        1159  1995
17:45:58.333 HHC03042I trk      0x000003BA67  0x0000001300        4864  2009
17:45:58.333 HHC03042I trk      0x000003CD67  0x0000003218       12824  1107
17:45:58.333 HHC03042I l2       0x000003FF7F  0x0000000800        2048  1024 - 1279
17:45:58.333 HHC03042I trk      0x000004077F  0x00000012BC        4796  900
17:45:58.333 HHC03042I l2       0x0000041A3B  0x0000000800        2048  768 - 1023
17:45:58.333 HHC03042I free     0x000004223B  0x00000059E3       23011
17:45:58.333 HHC03042I trk      0x0000047C1E  0x0000005C66       23654  1380
17:45:58.333 HHC03042I l2       0x000004D884  0x0000000800        2048  1280 - 1535
17:45:58.333 HHC03042I trk      0x000004E084  0x0000007DDE       32222  1381
17:45:58.333 HHC03042I trk      0x0000055E62  0x0000005E59       24153  1382
17:45:58.333 HHC03042I trk      0x000005BCBB  0x00000075B9       30137  1383
17:45:58.333 HHC03042I trk      0x0000063274  0x00000068CE       26830  1384
17:45:58.333 HHC03042I trk      0x0000069B42  0x0000004ACB       19147  1385
17:45:58.333 HHC03042I trk      0x000006E60D  0x0000004915       18709  1386
17:45:58.333 HHC03042I trk      0x0000072F22  0x00000054FC       21756  1387
17:45:58.333 HHC03042I trk      0x000007841E  0x000000471B       18203  1388
17:45:58.333 HHC03042I trk      0x000007CB39  0x00000059A0       22944  1389
17:45:58.333 HHC03042I trk      0x00000824D9  0x0000005A75       23157  1390
17:45:58.333 HHC03042I trk      0x0000087F4E  0x0000005023       20515  1391
17:45:58.333 HHC03042I trk      0x000008CF71  0x00000056BC       22204  1392
17:45:58.333 HHC03042I trk      0x000009262D  0x0000004E55       20053  1393
17:45:58.333 HHC03042I trk      0x0000097482  0x000000480F       18447  1394
17:45:58.333 HHC03042I trk      0x000009BC91  0x0000004DD6       19926  1395
17:45:58.333 HHC03042I trk      0x00000A0A67  0x0000003CE2       15586  1396
17:45:58.333 HHC03042I trk      0x00000A4749  0x0000003413       13331  1397
17:45:58.333 HHC03042I free     0x00000A7B5C  0x00000003B6         950
17:45:58.333 HHC03042I trk      0x00000A7F12  0x0000001599        5529  1200
17:45:58.333 HHC03042I trk      0x00000A94AB  0x000000102B        4139  901
17:45:58.333 HHC03042I trk      0x00000AA4D6  0x0000000364         868  1
17:45:58.333 HHC03042I trk      0x00000AA83A  0x000000697A       27002  1108
17:45:58.333 HHC03042I eof      0x00000B11B4  0x00000B11B4      725428
17:45:58.333 HHC03020I
17:45:58.333 CCKDMAP -v BLDTGT_Shadow_1.cckd32 ended; rc=0

As you can see, neither Cylinder 31 Head 0 nor Head 1 was ever written to on EITHER of the two volumes in question, which tells me the problem you described is simply NOT HAPPENING.

No "corruption" is taking place.

As I said in my opening comment at the beginning (top) of this GitHub comment, I suspect the reason it might not be happening MAY be because the corruption has already occurred at some point in the (distant?) past.

So I ask again whether it's possible for you to send me copies of the two volumes (BLDDLB and BLDTGT) that are not already corrupted? You mentioned earlier I believe that the very first part of your test/procedure that supposedly causes the corruption begins with first allocating all of the needed datasets on an "empty volume" (volumes?) such that the test/procedure then always starts "clean". Then, you claim, at some point, the AHELP(?) dataset becomes corrupted:

Ok, I set up the CCW trace, and believe I found something interesting. A little background of this JOB: The first JOB allocates a series of data sets across 9 mod 3 3390 volumes. The data set in question always ends up in the same cylinder on device 3004 since the volume always starts out empty. This data set always gets corrupted.

Well, I'm not seeing that.  :(

What I'm seeing is an already corrupted dataset that is never updated at all, causing it to remain corrupted (apparently forever?).

That's it. That's my testing and analysis of the situation so far.

Soooo..... Where do we go from here?

Do I need to wait for you to send me updated (un-pre-corrupted) dasd volumes to hopefully catch the corruption taking place?

Or do you think that MAYBE, at some point in the (distant?) past, you accidentally did something that caused the corruption and simply never noticed it until now? *`()`**

I await your reply.

Thanks.


*`()`** Either because you failed to use shadow files or failed to make backups of your Hercules dasds so you could restore them if anything went wrong like it obviously did at some point?

davekreiss commented 1 year ago

First about "corruption". There is no corruption of the PC DASD file. The corruption is Hercules updating the wrong track.

In the z/OS "corruption", the define extent says the I/O is limited to CCHH range 0062-0000 to 006B-000E ("CCW Trace Hercules log.log", line 541393):

   18:00:51 HHC01315I 0:3004 CHAN: ccw 63400010 7F29D138=>10CC0000 00000000 00620000 006B000E

However, within the domain of that define extent, Hercules wrote to CCHH 001F 0000 ("CCW Trace Hercules log.log", line 541753):

  18:00:51 HHC00440I 0:3004 CKD file C:/R/BLDDLB.192: updating cyl 31 head 0 record 1 kl 8 dl 256

So, the data written at CCHH 001F 0000 "corrupts" the data set at that disk address, namely the AHELP data set, because it belongs within the data set in the define extent that is AOSB3.

If it helps, what is written is a directory entry. A directory entry is a keyed file with kl=8 and dl=256. What is being written is a load module directory entry, but it is written to the AHELP data set directory.

Here is a subset of the map of that volume posted above on Jun 20th:

                     BLDDLB (3390-1)
--Start-- ---End--- ----Extent---- Data Set Name
CCCC HHHH CCCC HHHH     Size     #
001F 0000 0020 000E        2 C   1 DSK.MVS.NEW.AHELP   <-- the updating of cyl 31 (X'1F') outside the define extent domain
0062 0000 006B 000E       10 C   1 DSK.MVS.NEW.AOSB3   <-- the define extent maps to this data set

Back to TK3 - I know the wrong track is being updated but I can't locate where it is happening. There is no define extent in MVS 3.8 channel programs, and I spent a lot of time over the weekend trying to find where the wrong track is updated in the trace but got nowhere.

Concerning the TK3 DASD files there are no DASD PC file corruption. As I said I start with clean volumes, and they are allocated and updated via a batch stream of SMP which copies/updates/assembles/links to those volumes. No shadows as I can destroy/recreate/whatever these volumes at will with no data loss.

Unless I can identify where the wrong track is updated, I think the TK3 doesn't help isolate the problem.

Fish-Git commented 1 year ago

In the z/OS "corruption", the define extent says the I/O is limited to CCHH range 0062-0000 to 006B-000E ("CCW Trace Hercules log.log", line 541393):

   18:00:51 HHC01315I 0:3004 CHAN: ccw 63400010 7F29D138=>10CC0000 00000000 00620000 006B000E

However, within the domain of that define extent, Hercules wrote to CCHH 001F 0000 ("CCW Trace Hercules log.log", line 541753):

  18:00:51 HHC00440I 0:3004 CKD file C:/R/BLDDLB.192: updating cyl 31 head 0 record 1 kl 8 dl 256

Yes, I originally saw that. But when I tried it on your TK3 system that you sent me, I decided to try and catch the bug from a different angle rather than wade through thousands of lines of a CCW trace. And as I've proven, the bug is simply NOT happening!   .......   (Or at least not as I understood what the the bug was, which I can now see was very wrong.)

Back to TK3 - I know the wrong track is being updated but I can't locate where it is happening. There is no define extent in MVS 3.8 channel programs, and I spent a lot of time over the weekend trying to find where the wrong track is updated in the trace but got nowhere.

And THAT is what I originally failed to catch (failed to realize/understand).

Based on earlier comments:

Those seem corrupted data sets are consistent, ...

... is very consistent through all runs. It is reproducible and always occurs between the WTORs.

p.s. I don't suppose there's a fast and easy way for some idiot like me (who knows nothing about MVS / z/OS) to be able to recreate/reproduce this problem, is there? If there is, I'd very much appreciate you passing it along! It would sure help to expedite resolving this issue! Thanks!

I could construct a complete TK3 system with all it takes to run it with simple commands but that would take a few days at a minimum.

I was thus under the impression that the problem was reproducible, as described, on TK3 too. Or, perhaps more correctly, that the problem was always at the same place (same CCHH) on both (i.e. at the same Cyl 31 Head 0 on both z/OS and TK3 too).

I now understand that that belief was horribly wrong. I now understand that the corruption always occurs at the same place (cyl 31 head 0) only on z/OS 2,1!  (Yes? Is that correct?)

But, while the "same corruption" also always occurs on TK3 too, it doesn't happen at the same place (at the same CCHH) on TK3!  (Yes? True? Am I understanding things correctly now?)

Soooo............  Where does that leave us?

Well, apparently it leaves me with a TK3 system that supposedly can reproduce the corruption, but precisely where that corruption occurs (i.e. at what cylinder and head) we currently don't know yet.  :(

QUESTION:  Does this corruption always occur in the AHELP directory? If you sent me a volume map like the one you provided before on June 20 (but it was for your z/OS 2.1 system, yes?) but for this TK3 system you sent me, wouldn't that tell us the CCHH of the AHELP dataset and thus where the problem should be occurring? Yes? No?

As you can tell, I'm trying like hell to somehow get this TK3 system to still be useful to me for catching this bug. I'm honestly quite disappointed that we're unable to pinpoint precisely where (the precise CCHH) the corruption is occurring on TK3. That would make catching this bug so much easier!

QUESTION #2:

I happen to have z/OS 2.1 on my system. It's a stock/vanilla ADCD system that IBM gave me. (No! Really!) Is it possible to reproduce the problem on my z/OS 2.1 system via, say, the simple submission of a job stream you could send me? Is that possible?

Or in lieu of that, how difficult/complicated would it be for you to write up a set of detailed instructions for how I could reproduce the problem here on my system? Possible? Or impossible (i.e. too difficult)?

FINALLY,  I would be remiss to mention to you that, since you originally reported this problem, several important fixes have been committed to our development repository for the E7 (Prefix), 63 (Define Extent), 47 (Located Record), 4B (Locate Record Extended) and 27 (Perform Subsystem Function) CCW logic in Hercules. (As it turns out, that's what was delaying me from getting started on YOUR problem.)

Are you able to build Hercules for yourself from source? We have tools available (a nice script called "Hercules Helper" written by Bill Lewis, one of our developers) that makes doing so incredibly easy. You might consider building the latest develop branch of Hercules to see if it fixes your problem.  (It's worth a shot anyway!)

If it doesn't, then we can maybe try the z/OS 2.1 technique. Otherwise it looks like we're stuck with the TK3 technique, and I'm going to have to wait for you to determine where specifically (which cyl/hd) the corruption is occurring on TK3.

Does that sound reasonable?

I want you to know I really appreciate your patience with me on this, Dave. I really want to nail this issue. I really do! Data corruption is not a good thing (duh!), but I can't do it without your help, so I really appreciate your continued patience with me, and hope that we can work together to figure this damn thing out.

Thanks!

Fish-Git commented 1 year ago

Another quick question:

You mentioned that the corruption also occurs on TK3. Yes? My question to you then is, how may I visually see evidence of this corruption for myself on this TK3 system you sent me?

When I attach a second terminal to TK3, I notice that a TK3 logon screen appears. Only problem is, I don't know what userid and password to use to login with!

Then, once logged in, where do I go? What menu entries do I need to select to reach the screen where I can then actually see the corrupted directory entry? How can I prove to myself with my own eyes that the corruption has indeed occurred? That is to say, how can I manually verify the corruption for myself?

Note: I'm already familiar with z/OS's various TSO screens (I'm no expert, mind you, but I can get around okay!), but I'm unfamiliar with TK3's various TSO screens. I suspect they're quite a bit different. Yes?

After providing me with a userid and password that I can use to logon with, can you then walk me through the steps needed to reach the TSO screen where evidence of the corruption can easily (and obviously?) be seen?

Thanks!

davekreiss commented 1 year ago

The TK3 system is slightly broken but just enter HERC01 as user id and there is no need for password. A report showing the errors are a part of the job stream that encountered the error. I enclose an example further below.

Get into edit on HERC01.TEST.CNTL (RFE option 2) and pick FMTVTOC, my own VTOC formatting program, to get the CCHH extents of data sets on the BLDTGT volume. It will format the VTOC for the volume selected.

Sample EDIT (option 2) input below:

OTHER DATA SET:                         
   DATA SET NAME ===> 'herc01.test.cntl'

So, for example MVSSRC.BLD.LPALIB has one and only extent. That is also in the enclosed report:

  DS1EXT1...... 81 00 003C 0000 004F 000E.

Of course, you could do a LISTVTOC via IEHLIST sample in HERC01.TEST.CNT' member IEHLIST, which lists MVSSRC.BLD.NUCLEUS, but change to LPALIB if needed.

Notice cylinder/track numbers are decimal:

            EXTENTS  NO  LOW(C-H)   HIGH(C-H)
                      0   105   0     126  14

Output goes to the file allocated to device 00E.

Hope that helps. See "Some Sample Reports.txt" enclosed:


P.S. I am a rookie when it comes to git and tried to follow the instructions for build and failed off the bat. What/where/how do I use wget to get what I need to build - "wget https://github.com/wrljet/hercules-helper-windows/archive/master.zip". I see Linux examples but here again a rookie when it comes to Linux. (No Linux systems in this house either - no idea what to do with it.)

Reading instructions tells me I don't have much of anything on my Windows 11 system to do any of this - REXX or VS Studio or git. So, it isn't quite so simple to this rookie. Almost as complex as z/OS. Do you just have a windows version of your latest build I could just upload?

Fish-Git commented 1 year ago

The TK3 system is slightly broken but just enter HERC01 as user id and there is no need for password. A report showing the errors are a part of the job stream that encountered the error. I enclose an example further below.

Thanks. I'll take a look.

P.S. I am a rookie when it comes to git and tried to follow the instructions for build and failed off the bat. . . . Reading instructions tells me I don't have much of anything on my Windows 11 system to do any of this - REXX or VS Studio or git. So, it isn't quite so simple to this rookie.

I was under the impression that Bill's "Hercules Helper" script was supposed to do all of that for you! I was under the impression that Hercules Helper would automatically install ALL of the needed pieces, including git as well as Visual Studio itself and anything else too!

Bill? (@wrljet)

Can you help Dave with this?

In the mean time, I will upload a current Windows build of the latest development version of Hercules to my web site and send you the URL.

*What's your direct email address?  Mine is "fish@softdevlabs.com"_`()`_**


*`()`** Fuck the spam harvesters! All my mail goes through gmail, which does a pretty darn good job of identifying and filtering out all the garbage, so I don't mind posting my email address in a public forum any damn time I want.

Fish-Git commented 1 year ago

Get into edit on HERC01.TEST.CNTL (RFE option 2) and pick FMTVTOC, my own VTOC formatting program, to get the CCHH extents of data sets on the BLDTGT volume. It will format the VTOC for the volume selected.

Okay.................. But what am I looking for?

You have the exact same TK3 system I do. Yes? Where is the visual evidence of the corruption? What am I looking for?

You still haven't answered that.

davekreiss commented 1 year ago

I did some multiple attempts to make a short z/OS, with as little data to push around, but it was all futile. Only the real job stream does recreate the problem.

The program that is involved is purely the linkage editor (binder) PGM=IEWL, either TK3 or 2.1. As I said before, TK3 symptoms and location are different than 2.1. The 2.1 error is quite obvious - writing to wrong CCHHR per trace. The TK3 is not so clear where things go wrong. But both are very consistent.

So, the output looks something like this from FMTVTOC:

Format 1 DSCB CCHHR=0000 0001 0A
      +0 DS1DSNAM..... MVSSRC.BLD.LPALIB
     +3B DS1NOEPV..... 01                           <--- in one extent
     +69 DS1EXT1...... 81 00 003C 0000 004F 000E    <--- that is the data sets only extent

And that is where the data set that got damaged is.

The damage I think is in the directory, but I haven't quite figured out where yet.

An analysis shows a number (210) of members whose alias exist but the real is no longer present. I am going to compare the two directories - TK3 and 2.1 to see if that helps identify where the problem is since the problem doesn't on 2.1 in the LPALIB data set.

*** WARNING *** AHLACFV  is an alias but has no real entry
*** WARNING *** AHLDMPMD is an alias but has no real entry
*** WARNING *** AHLDSP   is an alias but has no real entry

You can reach me at my gmail email address davekreiss with the normal gmail stuff after.

Fish-Git commented 1 year ago

I'm still confused about this issue.

Let's step back and start over.

You created this issue because you wanted to learn how to do a cckd trace. Correct?

Okay. So the next logical question is, WHY did you want to do a cckd trace?

Answer: Because you claim corruption was occurring, and you wanted to catch it.

Okay. So the next logical question becomes, how did you discover this corruption? How did you notice it? How did you detect it?

THAT is the question I am now asking!

I am not interested in the traces you've done that you claim provides evidence that Hercules "is doing the wrong thing".

I am only interested at this point, in actually seeing actual hard evidence of the corruption. Traces are not hard evidence to me. Hercules's tracing could be faulty and misleading. Hercules's CCKD logic uses multiple threads to improve efficiency. Hercules also has special logic that, when it gets very busy (i.e. "is under stress"), it delays writing tracks to the actual dasd until later when it is not so busy. Instead, it keeps them in cache, so if the same track it updated again (or a track needs to be read), it can retrieve it from cache. And of course it also has logic that when its cache becomes full (or passes a certain threshold), it begins flushing its oldest cache entries out to disk to make room for new cache entries. It is very complex.

Having looked at many traces over the years (and you've seen them too!), one thing I've noticed is that it does not report which thread is doing the tracing. It might be that what you are seeing in your ckd traces is Hercules simply doing a stressed write. That "rogue" seek to cyl 31 head 0 that you saw could have been Hercules simply flushing its cache from an old write that was done long ago in the past. Maybe. I don't know. Because it's trace messages do not include the thread id.

THAT is why I cannot fully and completely trust a CCW trace as evidence of corruption occurring.

And that is why I want to see FOR MYSELF the actual corruption. I want to physically examine the actual record(s) on the dasd that are supposedly corrupted. I want to see it with my own eyes!

I want you to tell me how YOU know "it's" corrupted. (whatever "it" happens to be).

I want to know what a good, uncorrupted data block looks like, and what a corrupted data block looks like.

I want to know basically how you DISCOVERED the corruption, way back, long ago, that caused you to embark on this journey. I want you to tell me what it was that you saw, or noticed, or discovered that caused you to say to yourself: "What?! That's not right! That's wrong! That's been corrupted!"

I want you to tell me THAT. I want to see what YOU originally saw year(s) ago that told you that corruption was occurring.

THAT is what I am wanting to know today.

Do I need to examine each member of each dataset to verify "they look okay"? And then maybe see one of them "that looks wrong"? Do I need to do that? Because I will!

We have a utility called dasdcat that can dump or display each member of a given dataset. It reads the dataset's directory to know where the member is on disk, and reads it and then displays or dumps it. If a directory entry for a given dataset is corrupted, I would expect it to fail. Yes?

Do you understand now what I am asking for?

I am looking for a way to detect the corruption that does NOT involve looking at a CCW trace. Or a GTF trace. Or ANY trace. Instead, I want to actually SEE the corruption for myself. I want to be able to detect it when it has occurred.

Can you help me with that?

Please?

Fish-Git commented 1 year ago

You can reach me at my gmail email address davekreiss with the normal gmail stuff after.

Email sent with download URL for Hercules version 4.7.0.10982-SDL-DEV-g81a7bf7f.

Please give it a try.

If we're LUCKY, it just might fix your problem?   <me: fingers crossed>

wrljet commented 1 year ago

P.S. I am a rookie when it comes to git and tried to follow the instructions for build and failed off the bat. . . . Reading instructions tells me I don't have much of anything on my Windows 11 system to do any of this - REXX or VS Studio or git. So, it isn't quite so simple to this rookie.

I was under the impression that Bill's "Hercules Helper" script was supposed to do all of that for you! I was under the impression that Hercules Helper would automatically install ALL of the needed pieces, including git as well as Visual Studio itself and anything else too!

Bill? (@wrljet)

Can you help Dave with this?

Sure

Dave will have to show me exactly what he tried to do.

Bill

wrljet commented 1 year ago

The TK3 system is slightly broken but just enter HERC01 as user id and there is no need for password. A report showing the errors are a part of the job stream that encountered the error. I enclose an example further below.

Just my editorial comment... Stuff like this that might require running several or many times is much easier to work with if JCL can just be sent in through the card reader, and not involve TSO or 3270s in any way.

Bill

davekreiss commented 1 year ago

So, instructions first step says:

or, if you don't have git, or simply prefer: wget https://github.com/wrljet/hercules-helper-windows/archive/master.zip

No clue where to do this command. I tried a command prompt on my windows 11 system since it looked like a command but no luck. Not sure what to do with wget. Treat me like I am in kindergarten or maybe beginner 1st grade.

wrljet commented 1 year ago

OK, you can also paste that URL into a web browser and let it download it for you.

wrljet commented 1 year ago

BTW that wget suggestion was taken from the *nix instructions, without thinking on my part. I will correct/improve that first step.

davekreiss commented 1 year ago

Ok @wrljet I unzipped the mentioned zip file followed instructions and did a successful build. Thanks for assistance.

@Fish-Git Took your zipped Hercules 4.7 installed and ran both 2.1 and TK3 with similar results as related to 2.1: Damage to AHELP.

I also used the above Hercules build (which I suspect is identical to your zipped Hercules) with the same 2.1 and TK3 results.

Back to TK3 damage: First, a confession. I believe the "damage" was self-inflicted on TK3. I can find no sign of a directory block being written to the wrong CCHH as in the 2.1 situation.

However, the 2.1 damage is real. I have dumped the tracks (using DFDSS in the OS) in question, and they are indeed in the wrong extent as has been described. Not sure how to get a dump of a dasd image using any Hercules utility.

So how do I know the data (directory entry) is written to the wrong CCHH?

The AHELP data set is a DSORG=PO RECFM=FB LRECL=80 "text" data set which contains MVS help members. As is case with MVS 3.8 there is little to no extra directory data (user data). This data set directory starts at CCHH 001F 0000.

The AOSB3 data set is a DSORG=PO RECFM=U load library. Load modules contain much more directory "user data". A set of these members' directory entries for this library is written to the directory for AHELP. This is identified by the member names and the load library user data in the directory entries. This directory starts at CCHH 0062 0000.

Back to traces we have seen the define extent covers the AOSB3, but yet we see in CCW trace (even though as you say these trace entries aren't reliable) showing that I/O occurred to CCHH 001F 0000, and sure enough, a physical dump show these directory entries are there via a DFDSS dump track.

I thought I could reproduce this by simply running a link edit using the same input/output outside of the SMP job. Didn't produce the update to wrong CCHH. Maybe what I suspect goes back to your comment about caching and "multi tasking" going on in Hercules.

Using TK3 seems to be out of the question, as no reproduction seems to occur there. Anyway, that is where I am at this point. Hope this clarifies your questions. Let me know if you need more info and how we can approach reproduction and or diagnosing.

Fish-Git commented 1 year ago

The AHELP data set is a DSORG=PO RECFM=FB LRECL=80 "text" data set which contains MVS help members. As is case with MVS 3.8 there is little to no extra directory data (user data).

The AOSB3 data set is a DSORG=PO RECFM=U load library. Load modules contain much more directory "user data".

Okay........ So what you are saying is (I want to understand this!), the individual directory entries for the two different types of datasets look very different from each other. That YOU can look at a hex dump of a 256-byte directory block, and easily tell whether it is a directory block for a "text"-type dataset containing "text"-type directory entries, and a 256-byte directory block for a "load library"-type dataset containing "load library"-type directory entries. Yes? Is that what you are saying? That the directory blocks and their individual directory entries within them, look different? One type contains more/different information than the other? Yes?

That if we do a hex dump of all of the 256-byte directory blocks for the "text"-type AHELP dataset after the bug/corruption has occurred, that you can tell which of those 256-byte directory blocks are damaged/wrong. Yes? That is what you are claiming, yes?

So okay then... This is how I would like to proceed:

  1. Do a Hercules dasdcat of the z/OS 2.1's AHELP dataset to list all of its directory entries (just their names) BEFORE you run your job, so we can see what members are supposed to be in that dataset:

          dasdcat -i Q:/TK3/Dasd/BLDDLB.cckd32 MVSSRC.BLD.AHELP/? > dasdcat-before.log

  1. Do a Hercules dasdpdsu of the z/OS 2.1's AHELP dataset to "unload" all of its members to an output folder (Windows directory) of your own choosing BEFORE you run your job, so we can see what each member looks like. This should verify that each of the dataset's directory entries are all intact/accurate:

          dasdpdsu Q:/TK3/Dasd/BLDDLB.cckd32 MVSSRC.BLD.AHELP ascii C:/ahelp/before > dasdpdsu-before.log

  1. Do a Hercules cckddiag dump of the z/OS 2.1's AHELP dataset's directory blocks (cylinder 31 head 0, Yes?) BEFORE you run your job, so we can see what "good" directory blocks should look like.

          cckddiag -t -x -a 31 0 Q:/TK3/Dasd/BLDDLB.cckd32 > cckddiag-before.log

  1. Then run your job stream that causes the corruption.

  2. Then repeat steps 1-3 afterwards, but specifying different output log files for each of them, and a different output Windows folder name for step 2 too of course.

  If I am understanding you correctly, then you should be able to point me to one of the AHELP dataset's directory blocks dumped by cckddiag in step 3, and tell me:

        "See this one right here, Fish?  It's WRONG!  THIS directory block has been corrupted!  Do you see it?"

You should be able to do that, correct?

Futhermore, we should also be able to see that the list of the members in the AHELP dataset displayed by dasdcat in step 1 has changed. That there are directory entries (member names) listed that weren't there before, and vice-versa: that some are missing. Yes?

And finally, we should also be able to see that the actual contents of some of the members has changed as well, as "unloaded" by dasdpdsu in step 2. Yes?

  I am currently working on the changes necessary to also report the "thread id" in all of Hercules's CCKD trace messages. Once I am done with that, I will then have you rebuild your Hercules so that you can do another CCW trace for me, but this time to a trace capture file that you can then send to me, which is much more debugging-friendly than the tracing we've been doing.

But don't bother with the traces yet. Wait until I make my changes first and you rebuild Hercules. I'll let you know when they're ready so you can do that. For now, just concentrate on doing the above for me please. I'd really appreciate it, Dave.   I really, really want to see this corruption for myself!

I trust you that it's actually happening.

But I want to verify it. For myself. With my own eyes.

I am finding it incredibly hard to believe that such a serious bug in Hercules -- one that causes dasd corruption! -- could remain undetected for all these years. I would think other people would be complaining left and right. I mean, that code has been exercised (executed) literally many billions of times! Perhaps even trillions!

So I want proof.  Hard proof.  Undeniable proof.

Once I have that, then the detailed tracing to a trace-file that I will eventually have you do for me, should provide me with enough information that I should then (hopefully!) be able to manually reproduce the problem at will on my own system.

Once I'm able to do that, once I'm able to reproduce the problem on my own system, then finding and correcting the bug should hopefully be fairly simple and straightforward.

But that of course remains to be seen.  ;-)

I really appreciate your patience with me, Dave. I promise you, we will get to the bottom of whatever the frick is going on.

Or I'll die trying. One or the other.  ;-)

Fish-Git commented 1 year ago

Not sure how to get a dump of a dasd image using any Hercules utility.

cckddiag is your best bet.  Refer to step 3 in my previous reply above.

davekreiss commented 1 year ago

So, I have run test as prescribed with additional information. After the first two sets of diagnostics, I shutdown the OS and then performed steps 1-3 again.

Contents of the above .zip file:

Let me know what else we need to do; I think we are getting a little closer to what's going on.

Fish-Git commented 1 year ago

FYI:  I'm seeing a possible hole (bug) in I/O queue processing that might explain things. Maybe. Am researching...

Fish-Git commented 1 year ago

(EDIT: the below described bug was determined to not exist. See my followup comment further below.)


Mark (@dasdman),  (and anyone/everyone else too!)

Please check me on this before I spend a lot of time and effort to try and create a test program to trigger it:

I believe I may have found a hole (bug) in Hercules's I/O request handling that could allow two separate I/O requests for the same device to execute simultaneously.

The condition under which it looks like it could occur is simply the guest issuing a subsequent SSCH before the previous one, already underway (i.e. still running/active) has completed, such that we then have two I/O requests executing simultaneously on the same device, which needless to say is not good given that processing an I/O request necessarily involves updating fields in DEVBLK, thereby causing the two I/Os to interfere with each other.

Here's how it looks like it can easily happen under the described conditions: *`()`**

SSCH is issued and begins processing, eventually reaching the point in channel.c where the device's "start" function is called to process the actual CCW request:

/*-------------------------------------------------------------------*/
/* B233 SSCH  - Start Subchannel                                 [S] */
/*-------------------------------------------------------------------*/
DEF_INST( start_subchannel )
{
...
    /* Start the channel program and set the condition code */
    regs->psw.cc = ARCH_DEP( startio )( regs, dev, &orb );

 

/*-------------------------------------------------------------------*/
/* START A CHANNEL PROGRAM                                           */
/* This function is called by the SIO and SSCH instructions          */
/*-------------------------------------------------------------------*/
/* Input                                                             */
/*      dev     -> Device control block                              */
/*      orb     -> Operation request block                           */
/* Output                                                            */
/*      The I/O parameters are stored in the device block, and a     */
/*      thread is created to execute the CCW chain asynchronously.   */
/*      The return value is the condition code for the SIO or        */
/*      SSCH instruction.                                            */
/* Note                                                              */
/*      For S/370 SIO, only the protect key and CCW address are      */
/*      valid, all other ORB parameters are set to zero.             */
/*-------------------------------------------------------------------*/
int
ARCH_DEP(startio) (REGS *regs, DEVBLK *dev, ORB *orb)
{
...
        /* Return condition code 2 if device is busy */
#if defined( OPTION_SHARED_DEVICES )
        if (unlikely((dev->busy && dev->shioactive == DEV_SYS_LOCAL)
            || dev->startpending))
#else
        if (unlikely((dev->busy)
            || dev->startpending))
#endif
        {
            RELEASE_DEVLOCK( dev );
...
            return 2;
        }
...
        /* Schedule the I/O for execution */
        rc = schedule_ioq((sysblk.arch_mode == ARCH_370_IDX) ? regs : NULL,
                          dev);

 

/*-------------------------------------------------------------------*/
/* Execute a queued I/O                                              */
/*-------------------------------------------------------------------*/
DLL_EXPORT void* device_thread( void* arg )
{
...
            while ((dev = sysblk.ioq) && !sysblk.shutdown)
            {
...
                dev->tid = thread_id();
...
                    /* Execute requested CCW chain */
                    call_execute_ccw_chain( sysblk.arch_mode, dev );
...
                dev->tid = 0;
            }
            // end while ((dev=sysblk.ioq)  && !sysblk.shutdown)

 

/*-------------------------------------------------------------------*/
/* EXECUTE A CHANNEL PROGRAM                                         */
/*-------------------------------------------------------------------*/
void*
ARCH_DEP(execute_ccw_chain) (void *arg)
{
...
    OBTAIN_DEVLOCK( dev );
...
    set_subchannel_busy(dev);
    dev->startpending = 0;
...
    /* Call the i/o start exit */
    if (dev->hnd->start)
    {
        RELEASE_DEVLOCK( dev );
        {
            (dev->hnd->start) (dev);
        }
        OBTAIN_DEVLOCK( dev );
    }
...

  And there you have it!

Don't see it?   Allow me to point it out to you...

The startio function is never detecting that the device is busy and is thus always queuing the new I/O request for processing, and the device_thread function is never checking to see if the device already has an active request still in progress before queuing the new request either, AND the execute_ccw_chain function is never checking if the device is already busy either, before blindly calling set_subchannel_busy to set the dev->busy flag and begin processing the request! (causing it to eventually reach the (dev->hnd->start)(dev) statement and call the device's CCW processing logic AGAIN, while it's still in the middle of processing (hasn't finished processing) its previous request yet!)

Oops!   :)

And it's all caused by this bug right here. This is the primary culprit (although the other players are just as guilty in my opinion):

#if defined( OPTION_SHARED_DEVICES )
        if (unlikely((dev->busy && dev->shioactive == DEV_SYS_LOCAL)
            || dev->startpending))
#else
        if (unlikely((dev->busy)
            || dev->startpending))
#endif

The OPTION_SHARED_DEVICES build option is always enabled by default in featall.h:

#if defined( OPTION_SHARED_DEVICES ) && defined( OPTION_NO_SHARED_DEVICES )
  #error Either OPTION_SHARED_DEVICES or OPTION_NO_SHARED_DEVICES must be specified, not both
#elif !defined( OPTION_SHARED_DEVICES ) && !defined( OPTION_NO_SHARED_DEVICES )
  // Neither is #defined.  Use default settings.
  #define OPTION_SHARED_DEVICES
  #define FBA_SHARED
#elif defined( OPTION_NO_SHARED_DEVICES )
  #undef OPTION_SHARED_DEVICES
  #undef FBA_SHARED
#elif defined( OPTION_SHARED_DEVICES )
// Leave FBA_SHARED alone, either #defined or #undefined, as desired.
#endif // OPTION_SHARED_DEVICES

Thus, the test for device busy in the startio function that actually gets built is:

        if (unlikely((dev->busy && dev->shioactive == DEV_SYS_LOCAL)
            || dev->startpending))

And due to the if statement using && to combine the two conditions, both conditions must of course be true in order for the code block that returns Condition Code 2 to be taken, and it never is. The condition will always be false whenever there is no shared device I/O active.

dev->shioactive is only ever set to DEV_SYS_LOCAL (0xffff) whenever a shared I/O is active on the device. If there isn't any shared I/O active on the device, then dev->shioactive is always going to be DEV_SYS_NONE (i.e. 0).

Like I said:  Oops!   :)

Anyway, that's what I'm seeing. I have not run any tests yet to see whether this condition can actually occur or not. I might just be seeing things. Which is why I am asking you (or anyone else!) to please confirm my suspicions. Or deny them. Whatever.

I would like to create a test program to try and trigger this condition in order to conclusively confirm my findings and verify that the bug does indeed exist (and I will probably get around to eventually doing that), but as I said in my opening statement, I would like confirmation of my findings first. I don't want to spend the time and effort to write such a program if I don't have to. I would like a second pair of eyes to confirm what I'm seeing first.

But if you agree with what I am seeing (if you are seeing the same thing), then I most definitely will take the time to create such a test program so we can add it to our suite of Hercules QA tests.

That's it.  Let me know if you're seeing the same thing as I'm seeing.

Thanks!


*`()** &nbsp;_The guest issuing two or more back-to-backSSCH` (Start Subchannel) requests without waiting for the previous one to complete beforehand._


(EDIT: the bug described in the above GitHub comment was determined to not exist. See my followup comment further below.)

davekreiss commented 1 year ago

I know I am getting into this without knowing much about Hercules internals but 1) Looking at one of the CCW trace I see in pairs 6,954 "HHC00396I 0:3004 start i/o" eventually followed by "HHC00396I 0:3004 end". That is there is no trace entry of a start I/O issued before the previous start I/O end is traced. In other words, at no time in trace is there two start I/O active at one time. Ok the time stamps don't have enough resolution to determine order and physical order of actual trace records can be recorded out of order if there are multiple threads tracing, so CCW trace is probably not reliable for verification that there are no concurrent start I/O occurring. 2) As long as the UCB is marked busy in z/OS subsequent I/O requests are queued in the operating system off the UCB awaiting completion of the current I/O of that devices UCB. So, if the I/O is still active in Hercules but the OS is told I/O has completed by Hercules when it hasn't the OS will start the next queued I/O. 3) In the trace there are no define extent for the range of tracks updated incorrectly. So those tracks should be protected and should not have had any I/O to them.

Fish-Git commented 1 year ago

2. As long as the UCB is marked busy in z/OS subsequent I/O requests are queued in the operating system off the UCB awaiting completion of the current I/O of that devices UCB. So, if the I/O is still active in Hercules but the OS is told I/O has completed by Hercules when it hasn't the OS will start the next queued I/O.

True. And I doubt that is happening in your case. So the unconfirmed bug that I believe I may have found in Hercules is likely not your problem. I readily admit that.

But it is a legitimate bug in Hercules that needs to be fixed IMO. It might not be your bug (the bug that is causing your corruption problem), I don't know. If an operating system does issue two back-to-back I/O requests (SSCH) to a device (which SHOULD cause the second one to complete with cc=2 busy if the first one is still in progress, but isn't), then the bug I described will occur. So it is a bug.

You need to keep in mind Dave that I'm only getting started on my bug hunt. During bug hunts, I tend to question everything. The bugs I find (if any) during these bug hunts may not necessarily have anything to do with the specific bug being hunted, but they are bugs nonetheless that need to be fixed when discovered.

Fish-Git commented 1 year ago

So, I have run test as prescribed with additional information. After the first two sets of diagnostics, I shutdown the OS and then performed steps 1-3 again.

Contents of the above .zip file:

  • RES201.txt is the OS configuration file.
  • prt201.txt contains diagnostic output written to SYSOUT=A and the system log.
  • hercules.log is the Hercules log.
  • Summary of events.docx is the sequence of events and diagnostics in time order.

You did things wrong. You can't run any of the utilities (dasdcat, dasdpdsu or cckddiag) while Hercules is still up and running and has the file opened for writing. Notice the errors that were reported by both dasdcat and dasdpdsu in the "before" and "after" reports:

HHC00364W 0:0000 CCKD file c:/r/BLDdlb.192: forcing check level 1
HHC00007I Previous message from function 'cckd_chkdsk' at cckdutil.c(1339)
HHC00368W 0:0000 CCKD file c:/r/BLDdlb.192: free space errors detected
HHC00007I Previous message from function 'cckd_chkdsk' at cckdutil.c(1774)
HHC00376W 0:0000 CCKD file c:/r/BLDdlb.192: free space not rebuilt, file opened read-only
HHC00007I Previous message from function 'cckd_chkdsk' at cckdutil.c(2526)

This is because Hercules was still up and running, and updated data was still sitting in Hercules's cache and hadn't been flushed out to the actual physical dasd yet. That only happens once Hercules is completely shutdown. What dasdcat and dasdpdsu both "saw" (and reported) was therefore NOT the dasd's actual true contents.

But the good news is yes I can see the damage/curruption.

Let me know what else we need to do; I think we are getting a little closer to what's going on.

I know we're getting close!

In fact, I think I may have already found the problem!  (I think)

I am going to be building a test version of Hercules with hopefully the fix in it that I want you to try for me. I'll let you know when it's ready. In addition to the fix, I want to add some debugging code to it too before sending it to you, just in case I'm wrong about the bug. Expect an email from me with the URL of the (hopefully!) fixed version and instructions on how to run it within the next day or so.

Fish-Git commented 1 year ago

P.S. The bug (if I'm right about it) has been in Hercules for about the past 5 years or so. It was first introduced (I think) in version 4.2.

So if you try version 4.1 and it works, but version 4.2 fails, then that's just one more tiny bit of evidence that what I found is very likely the problem.

Fish-Git commented 1 year ago

P.P.S. All releases of SDL Hercules Hyperion can be found here:

They're listed in reverse order, from newest to oldest, with the oldest (version 4.1) at the bottom.

Fish-Git commented 1 year ago

@davekreiss

I sent you an email. Did you get it?

davekreiss commented 1 year ago

Yes and I am in process of running tests between work and home interference. DaveOn Jul 13, 2023, at 4:30 PM, Fish-Git @.***> wrote: @davekreiss I sent you an email. Did you get it?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Fish-Git commented 1 year ago

I know we're getting close!

In fact, I think I may have already found the problem!  (I think)

P.S. The bug (if I'm right about it) has been in Hercules for about the past 5 years or so. It was first introduced (I think) in version 4.2.

CONFIRMED!

Bug found and squashed!  :))

Fixed by commit bd8362828d379105f7629c3b97df000dd17d1071.

Closing issue.

Fish-Git commented 1 year ago

FYI:

Regarding my earlier comment about a possible Hercules IOQ processing/scheduling bug...

I was wrong. Testing has revealed that all is well. There is no bug.

The channel was actually doing some things I hadn't realized the first time I looked at it.

I apologize to Mark (@dasdman) and the rest of the team for unintentionally Crying Wolf.

ALL IS WELL.   THERE IS NO BUG.

We return you now to your regularly scheduled program...