Hercules has crashed - continuation of #187

werner02815 commented 5 years ago

Note: this issue is closely related to issues #187 ( "Hercules has crashed") and #490 ( "HHC00310E and HHC00342E on CCKD64 disk").

I have now installed Hyperion 4.2 and formated 3390-54 as CCKD64 devices. When I run ICKDSF job to format these volumes in guest system (zOS1.13), hercules sometimes crashes. The crash does not occur every time. It mostly likely depends on running these format jobs in parallel.

The volume was created with following command:

U:\PSI_DUMP>C:\hercules\Hercules-4.2.0.9574\dasdinit64.exe -z     U:\PSI_DUMP\H5SM00     3390-54        H5SM00
HHC01413I Hercules utility dasdinit64 - 64-bit DASD image file creation program; version 4.2.0.9574-SDL-g5176dac2 (4.2.0.9574) 
HHC01414I (C) Copyright 1999-2019 by Roger Bowler, Jan Jaeger, and others
HHC01417I ** The SoftDevLabs version of Hercules **
HHC01415I Build date: Feb 24 2019 at 10:11:02
HHC00472I 0:0000 CKD64 file U:\PSI_DUMP\H5SM00: creating 3390 volume H5SM00: 65520 cyls, 15 trks/cyl, 56832 bytes/track 
HHC00471I 0:0000 CKD64 file U:\PSI_DUMP\H5SM00: 65520 cylinders successfully written
HHC02423I DASD operation completed

and is defined in my hercules configuration file as:

3100    3390    U:/PSI_DUMP/H5SM00 cu=3990-6

I have dumps from two separate new occurences and will upload them to your FTP site.

As I mentioned, this crash does not occur each time that I run ICKDSF.

Also, I powered off hercules and saved the resulting dasd image file after a successfull ICKDSF run, so I can upload that image too (along with the crash dumps) if you'd like to see it. Maybe you will be able to see a problem in it?

Please let me know what other information you might need, or whatever other tests I should perform, to help you with this bug.

Thanks!

werner02815 commented 5 years ago

Doc uploaded to your site with name "issue194.zip".

Fish-Git commented 5 years ago

Please let me know what other information you might need, or whatever other tests I should perform, to help you with this bug.

Have you tried running a level 3 "cckdcdsk64" on it yet? Does it report any problems?

Fish-Git commented 5 years ago

Doc uploaded to your site with name "issue194.zip".

Thank you. I will download it and look at it, and will let you know what I find (if anything).

Fish-Git commented 5 years ago

It mostly likely depends on running these format jobs in parallel.

What do you mean by that? How do you run: "these format jobs in parallel"?

Do you mean you create multiple different 3390-54 dasds, add them all to your Hercules configuration, start z/OS 1.13, and then submit multiple ICKDSF jobs, each one formatting a different dasd? (so that they're all running at the same time)

Is that what you mean?

I will give it a try to see what happens!

werner02815 commented 5 years ago

What do I mean with parallelism?

I use following procedure: I create a lot (48) 3390-54 CCKD64 devices with dasdinit64 job. After that I start hercules, zOS system and submit an ICKDSF job for each of these 48 DASD volumes. These jobs run in parallel (depending on performance of guest system). I see up to 20 jobs running in parallel. When I configure my host system to run these ICKDSF jobs serialized I have not seen this abend till now.

Of course running ICKDSF serialized would not be any issue, but I see this crash coming up also afterwards when using these DASD volumes and these processes are no longer customizable to serialize each I/O request ...

"cckdcdsk64" will be performed tomorrow when I have access to my system again. But I do not expect that this will show any meaningful result because when hercules crashes it will not write out any pending DASD update. So I only can perform this task immediately after dasdinit. Nevertheless I will update with result.

Fish-Git commented 5 years ago

After that I start hercules, zOS system and submit an ICKDSF job for each of these 48 DASD volumes.

What is the most efficient (easiest) way to do this? Create 48 separate (different) jobs, and then manually submit them all, one by one? Is there an easier way? For example, is it possible to submit these 48 jobs but keep them in the JES2 queue "on hold"? (i.e. submit them so they don't run immediately) And then later, do a TSO "m.5", "st fish*", and then use some prefix command ("r"?) ro release them all at once? Do you understand what I'm asking?

In other word, rather than "edit" each job (tso "3.4", "e") and enter the "submit" command to submit that one job, then PF3, "e" for the next job, "submit", PF3, "e", "submit", etc... Is there a more efficient way to submit them all on hold and then later, release them all at once?

Forgive me, but I'm not experienced with z/OS! I've learned a little bit here and there over the years, but I'm mostly a newbie!

Thanks for any assistance you can provide.

p.s. The "EXEC PGM=SYSCMD" steps in your ICKDSF job doesn't work for me. Where did "SYSCMD" come from? How do I get it?

Fish-Git commented 5 years ago

Doc uploaded to your site with name "issue194.zip".

Thank you. I will download it and look at it, and will let you know what I find (if anything).

FYI: they're all blowing up on the exact same statement that they were blowing up on before (i.e. on the same statement in the cckd_gc_l2 function as already identified in Issue 187 )

So the bug, whatever it is, is not CCKD64 related.

Rather, the bug (whatever it is) is inherent in Greg's CCKD logic (and was carried over into the new CCKD64 code; i.e. the same bug was copied from the original code into the new CCKD64 code).

I would like to try and recreate this crash via your many parallel ICKDSF jobs technique, but need your assistance. (See previous comment above). Thanks!

werner02815 commented 5 years ago

SYSCMD is a user writen program and not generally available.

What release of z/OS are you using? Depending on the zOS release you are using I can create a procedure to do it using only "basic zOS support". I simply create one large JCL stream with following cards in it:

//jobcard1
jcl for UCB1
//jobcard2 (use different name to allow parallel job execution under jes)
jcl for UCB2
..
//jobcard48
jcl for UCB48

and then submit this one job stream.

Fish-Git commented 5 years ago

SYSCMD is a user writen program and not generally available.

Ah. Okay. Not a problem. I can write a simple Hercules script to dynamically attach all of the new dasd all at once (after z/OS has already been IPLed without them), and they should all appear, by default, as being "offline" to z/OS.

Then I can submit one large jobstream as you suggested, and they should all run in parallel. Cool! I didn't know a jobstream could have multiple jobs in it. It seems so obvious now. I don't know why I didn't think of it. Sometimes I can be so stupid. :)

THANKS! I'll give it a try...

werner02815 commented 5 years ago

As expected, cckdcdsk64 does not show any errors:

grafik

It's really frustrating that I am unable to find a reliable way to recreate this problem! So I really expect you will not be able to do it on your system.

Any other possibility might be to provide more information in the log before the crash occurs? Maybe you can provide me with a special version (special debug build) designed to gather more data?

I have seen these crashes most often when running ICKDSF, but I have also seen it happen when copying data to a 3390-54. I have setup a job creating random data and filled up this dasd, but the crash is not recreatable at will. (Maybe it depends on data itself?).

Still thinking about other ways to try and recreate it ...

Fish-Git commented 5 years ago

As expected, cckdcdsk64 does not show any errors:

FYI: are you familiar with log files? Inserting a tiny IMAGE (screen shot) isn't as helpful as inserting the actual TEXT output of your command.

You can create a log file of your command output by simply redirecting the output of your command to a file of your choosing, by doing this:

foobar -a -b -c > my_foobar_logfile.txt 2>&1

In the above example, the string > my_foobar_logfile.txt causes all messages that the "foobar" command would normally display on the screen, to be written to the file my_foobar_logfile.txt instead.

The string 2>&1 causes all error messages to also be written to the same file too.

Used together (as you see in the above example), creates a text file (e.g. "my_foobar_logfile.txt") that you can then copy & paste into your GitHub reply comment.

Doing that is much more helpful than inserting a screen shot image!

Fish-Git commented 5 years ago

It's really frustrating that I am unable to find a reliable way to recreate this problem!

I understand your frustration and totally agree! Being unable to reproduce a problem you sometimes randomly have is very frustrating and makes it very difficult to learn what is causing it.

So I really expect you will not be able to do it on your system.

And you are 100% correct too! I created 48 (forty-eight) 3390-54 dasds (CCKD64) and submitted one large jobstream with 48 separate ICKDSF jobs in it, and they all ran to successful completion (and Hercules did not crash). I tried several times, with no luck.

I'm going to try again, but using regular CCKD dasd this time (instead of CCKD64) to see if that makes any difference.

In the mean time, I am also going to think about your suggestion (which was a good one!) of creating a special debug version of Hercules for you that gathers additional information. That might take me while though. I'm not sure yet what additional information I need to gather as I don't really know why it's crashing where it is!

But when it's ready I will let you know.

Hang in there. We'll figure it out eventually I'm sure.

Fish-Git commented 5 years ago

I simply create one large JCL stream with following cards in it:
//jobcard1
jcl for UCB1
//jobcard2 (use different name to allow parallel job execution under jes)
jcl for UCB2
..
//jobcard48
jcl for UCB48
and then submit this one job stream.

How are you submitting that JCL stream? Via the card reader? Or some other way? Because the way I did was to create a PDS member ADCD.LIB.JCL(FISH#194) with those 48 jobs in it, and then use the "submit" command to submit that member directly from the (TSO? ISPF?) editor.

How can I verify (how can I tell?) whether all the individual jobs are running serially one after the other or all in parallel at the same time? Is there some z/OS command I can use? (Remember I'm not experienced with z/OS! I'm a newbie!)

Thanks!

werner02815 commented 5 years ago

I am doing submit same way as you.

The easiest way to check whether (how much) jobs run in parallel is if you enter following command on your operating system console (or if there is none via HercGUI): D A,XXX* where XXX is common prefix of your jobnames. Please note you must use unique jobnames to allow parallel execution. How many jobs do you see?

If you only see one job we have to further analyse, as there are a lot of parameters which determine parallelism.

pjfarleyiii commented 5 years ago

The key to run the ICKDSF jobs in parallel is to use different JOB names so that they can run in different JES initiators. I did a test with 10 ICKDSF's running at the same time as follows.

I created 10 CCKD64 volumes with this (model) command:

dasdinit64 -bz2 -lfs USR600.cckd64 3390-54 USR600

Which created 10 dasd files with names like USR600.cckd64, USR601.cckd64, etc. up to USR609.cckd64.

Then I added the 10 new DASD devices to the config file with (model) lines like this:

0AB3    3390    dasd\USR600.cckd64 ro sf=shadow/zaus60_

Then I IPL'ed z/OS and varied the 10 new units offline. Then I ran 10 jobs like the following model job all at the same time, varying the job name last character from 0 to 9, the volume names from 0 to 9, and the UNIT parameters as per the new config units:

//IBMUSER0 JOB (ACCT),CLASS=A,MSGCLASS=H,NOTIFY=&SYSUID   
//FORMAT   EXEC PGM=ICKDSF,REGION=0M,PARM='NOREPLYU,FORCE'
//SYSPRINT DD  SYSOUT=*                                   
//SYSIN    DD  *                                          
  INIT UNIT(AB3) NVAL NVFY VOLID(USR600) PURGE -          
    VTOC(0,1,2939) INDEX(196,0)                           
  TRKFMT UNIT(AB3) NVFY CYLR(206,65519) ERASE             
//*

The 10 jobs all ran together for about the same elapsed time (148+ minutes) and all 10 completed successfully. No Hercules crash, and shutdown completed normally.

I ran this test with 4 emulated CP engines with 8G Hercules memory and it pegged all four CP's at 100% while the jobs were running on my Win10 Core i7-6700 4.00GHz machine. The PC ran at slightly over 70% CPU utilization for the duration of the test.

Let me know if there is any variation of this test you would like to see run.

HTH

Peter

werner02815 commented 5 years ago

Now run same procedure with exactly same parameters (dasdinit64 with parameters lfs and bz2 instead z).

Running ICKDSF with exactly same parameters for 1 volume succeeds (runtime 54 minutes). After this I started parallel ICKDSF job for 18 volumes and hercules crashed 2 minutes afterwards.

What I have seen that it is really absolutely random if and when it happens!

Fish-Git commented 5 years ago

Running ICKDSF with exactly same parameters for 1 volume succeeds (runtime 54 minutes). After this I started parallel ICKDSF job for 18 volumes and hercules crashed 2 minutes afterwards.

Thanks for the report.

I'm currently working changes to try and capture more information in the dump whenever the crash does occur. I will let you know when it's ready.

I'm also investigating how to provide finer control over when the garbage collector runs since that's obviously where the bug is.

Bottom line: I haven't given up on this bug yet! I'm still working on it! And I'm absolutely determined to find it!

Fish-Git commented 5 years ago

@werner02815: @pjfarleyiii:

In the mean time, I am also going to think about your suggestion (which was a good one!) of creating a special debug version of Hercules for you that gathers additional information. That might take me while though. I'm not sure yet what additional information I need to gather as I don't really know why it's crashing where it is!

But when it's ready I will let you know.

I'm currently working changes to try and capture more information in the dump whenever the crash does occur. I will let you know when it's ready.

It's ready:

Hercules-4.2.0.9594-SDL-g6197f091-x64.zip

Changes have also been committed to the repository too, in case either of you want to build it for yourself from source.

If a crash occurs during your testing, please save the resulting crash dump, compress it, and then upload it to my FTP site's incoming folder and send me an email notifying me (or better, just post a comment in this thread) and I will download and analyze it.

I added an extra cckd_trace statement just before the point where the crashes have been occurring, and, additionally, modified the crash dump logic to also save the cckd logic's internal trace table. Those two together should hopefully give me enough information to figure out what's going on. It still might not of course, but it will at least give me more information than we currently have.

Thanks.

I'm going to continue trying to recreate the crash myself. So far I have been unsuccessful. Every time I try it always works fine. No crash. Nevertheless I have not given up hope. I will keep trying. Maybe I will get lucky.

werner02815 commented 5 years ago

I was able to recreate on first try! Dump uploaded with name Hercules.dmp#194. Hopefully it helps. Thanks for your support.

Fish-Git commented 5 years ago

Dump uploaded with name Hercules.dmp#194.

Got it! Thanks. Will look at it tomorrow after I get some sleep.

Fish-Git commented 5 years ago

MAJOR BREAKTHROUGH!

I can now re-create the bug every single time! (i.e. whenever I want!) :-D

I am very, VERY close to finally nailing this sucker!

I found the bug weeks ago (possibly longer) but had trouble understanding it well enough to be able to prove that I found it (due to my inability to reproduce it due to my inability to fully understand all of the cckd code that was involved).

But after days and days and days of studying the code over and over and over again, I finally began to understand things well enough to be able to write a simple program that recreates the condition that causes the crash. I can now cause the crash whenever I want, and it's the same crash that's been plaguing us since forever, so I'm 100% certain it's "THE" bug.

So now my new task is to fix the bug, which may take me a while since it looks like it's going to involve a major rewrite of at least one function (and some minor fixes to a few others) since it doesn't look like it can be salvaged.(*)

I'll post more information later about the bugs I found after I code and test my fix (which I can now do with complete confidence).

For now, it's off to bed. I need to be well rested before I attempt to code the fix!

Zzzzzz.... =-o

(*) The buggy function (cckd_gc_l2) is just too messed up to try and fix. It's basically written completely wrong. Greg(**) must have been drunk the day he coded this particular function! %-P

() The bug is in original CCKD code. (i.e. it is _not**_ a new bug that I accidentally introduced with my new cckd64 code.) Rather, my new cckd64 code simply ended up copying Greg's bug into the new cckd64 logic.

And interestingly, the bug is unlikely to occur except with very large dasds, with at least 20,500 or more cylinders (i.e. 3390-27 or 3390-54), which is why the bug is so rare and why we've had so much trouble reproducing it and why it hasn't bitten us before now. Like I said, I'll provide more details later. I need to get some sleep.

werner02815 commented 5 years ago

Fantastic! You are really the BEST. Thanks for update and have a good sleep.

Fish-Git commented 5 years ago

WARNING: LONG/LENGTHY POST AHEAD

Sorry it has taken me so long to followup, but my originally envisioned fix didn't exactly goes as planned (to say the least!). Rather, it went in a completely different (but wholly justified!) direction. Allow me to explain.

First, an explanation of the mechanics behind the bug. Afterwards I'll explain why I decided not to bother trying to fix it and my justification for doing so.

(Note: in the text that follows I shall use the term "CCKD" to refer to the cckd handler code that maintains compressed dasds, and refers to both cckd logic as well as cckd64 logic.)

CCKD maintains an "internal free block" (ifb) chain, being a chain of where all of a dasd's free space is located. It is allocated in chunks of 1024 entries and the logical end of a chain is marked with a "next" index of -1. As CCKD runs, it uses the list to locate where an updated track can be safely written. When an entry is consumed, it is removed from the chain by linking the previous entry to the next entry, thereby effectively removing the consumed entry from the original chain (i.e. the links in the chain effectively "skip over" the entry that was consumed).

The CCKD "Level 2" (L2) table is a table that identifies where each track of a 256-track "group" is located. The "Level 1" (L1) table at the beginning of the dasd file points to where each L2 table is. (Refer to "Compressed DASD File Structure" for more information.) In order to access any given track, CCKD must first read the L2 table. Once the L2 table has been read, it can then read the track data (since it is only the L2 table that knows where a given track is located).

For reasons that are unknown, the original author, at some point, decided the best ("preferred") place for the L2 tables was as close to the beginning of the file as possible. (This is of course untrue, but I'll leave that for later.)

To this end, it maintains a "L2_bounds" value, being where the end of all L2 tables should ideally be, and sets its "L2ok" flag to either true or false depending on whether all of a file's L2 tables are located below this boundary or whether one or more are found to be located above this boundary. It is this "L2ok" flag that determines whether the buggy cckd_gc_l2 function needs to be called or not, and once cckd_gc_l2 successfully relocates all L2 tables to be within the "L2_bounds" area, it resets the flag back to true.

At least that is what it is supposed to be doing! (That is what its stated purpose is, according to the comments at the beginning of the function.)

Unfortunately however, its logic is not only nonsensical but fatally flawed as well.

As previously explained (and as documented on the "Compressed DASD File Structure" web page), the L1 table contains one entry for each L2 table (and each L2 table contains 256 entries, one for each of the 256 tracks in that "group").

So let's do some quick calculations for, say, a 3390-54 model dasd, which has 65,520 cylinders.

Each cylinder has 15 tracks, so that's a total of 982,800 tracks.

Each track "group" is 256 tracks (one L2 table), meaing a 3390-54 has a total of 3,840 track groups (or 3,840 L2 tables). Thus the L1 table, which contains one entry for each 256-track group, will have 3,840 entries (each entry pointing to the L2 table for that group).

But recall that the internal free block chain (ifb) only has 1024 entries, and is not directly indexable. Instead, you must chase the chain from the beginning, each entry providing the index of the next entry in the chain, with the logical end-of-chain marked by a "next" index of -1.

So now let's take a look at the problematic (buggy) code to see what it's doing:

https://github.com/SDL-Hercules-390/hyperion/blob/71747ea86b75dd570a6a3c7b8632a71d51a3f036/cckddasd64.c#L4560-L4568

As you can see, it is iterating through each of its L1 entries (num_L1tab) using 'i' for its index, and attempting to compare each L1 entry to a calculated file position based on the current position plus the length of the ifb internal free block chain entry (ifb[i].ifb_len).

Now, if you've been paying attention I hope you can see how fatally flawed that logic is! It's using the same index value 'i' for both the L1 table and for the ifb internal free block chain as well!

And how many L1 table entries are there for a 3390-54? 3,840!

And how many entries are there in the ifb internal free block chain? 1024!

(Oops!)

As soon as L1 entry# 1024 or greater is reached, it will end up indexing past the end of the allocated ifb chain, and BOOM! goes Hercules!

Now it's easy enough to fix the bug, which I believe is a simple matter of changing the L1 table indexing variable from i to j instead (to match how it's used elsewhere in the function), but then I got to thinking... (don't you hate it when that happens?) ;-)

Why the frick are we trying to move the L2 table to the beginning of the file?! (i.e. further away from where its associated track data is located?!) Doing so only makes CCKD code less efficient (slower, not faster) by increasing the size of the seeks it ends up having to do to access the two types of data (the L2 table itself and the associated track data which the L2 table points to). Wouldn't it make more sense to leave the L2 tables exactly where they currently are? (which is likely near where the associated track data is too.) As we all know, larger seeks result in slower disk accesses, and to be more efficient (i.e. to make disk access faster) you need to minimize the amount of seeking you are doing. Seeks to data that is close by is much faster than seeking to data that is many megabytes (or possibly gigabytes!) further away. For efficiency, the L2 tables should not be relocated, but rather left exactly where they are!

Which is why my "fix" was to not even bother trying to fix the cckd_gc_l2 function at all, but to instead simply deprecate it altogether. The proper "fix" in this case (IMO) is to simply remove (logically) the cckd_gc_l2 function altogether, and have it instead immediately return 0 (success) without actually doing anything at all.

Which is exactly the fix that I have just committed. Thus it is now impossible for the cckd_gc_l2 function to ever again crash since it now no longer logically exists! :)

As justification for such a fix (and to provide proof of my above claims regarding making CCKD access less efficient when L2 tables are moved to the beginning of the file instead of just leaving them where they currently are), I have also created a new "map" utility called cckdmap.exe that reports where each CCKD control block is located in a given cckd/cckd64 dasd or shadow file, and reports the average seek distance from each L2 table to its associated track data.

Additionally, I have added a new -c option to the convto64.exe utility (which was also modified to also accept cckd64 input) to force all L2 tables to be written to the beginning of the file just like the flawed cckd_gc_l2 was wanting to do.

Use convto64 with the -c option to create a cckd64 file with all of the L2 tables at the beginning of the file, and then use cckdmap to report its calculated average L2-to-track seek distance.

Then do a cckdmap on the original cckd64 dasd where the L2 tables are scattered throughout the file near where their track data is, and compare the two reported average L2-to-track seek distances. You will see that moving the L2 tables to the beginning of the file greatly increases the seek distance and thus makes CCKD overall less efficient (i.e. slower), not faster and more efficient.

So there you have it: the "fix" for this long standing bug (which has been plaguing us for years now) is to simply stop trying to do what was causing Hercules to crash in the first place!

Patient: "Doctor! Doctor! It hurts when I do this!"
Doctor: "Then don't do that!"

>;-)

Fish-Git commented 5 years ago

Closed (forever!) by commit 6e4c52cda8c1354a658263beb2aac8b343b01252.

Fish-Git commented 5 years ago

@werner02815: @pjfarleyiii:

The new Windows x64 pre-built with the above fix in it can be downloaded from here:

(deleted! it's bad!)

werner02815 commented 5 years ago

@fish: sorry to come back again: I have installed 9612 version and rerun my procedure. Unfortunately hercules crashes once again (2 tries, 2 crashes).

19:11:59.844 HHC01414I (C) Copyright 1999-2019 by Roger Bowler, Jan Jaeger, and others 19:11:59.844 HHC01417I The SoftDevLabs version of Hercules 19:11:59.844 HHC01415I Build date: Mar 23 2019 at 03:39:48

19:08:16.108 19:08:16.108 19:08:16.108 19:08:16.108 OOPS! 19:08:16.108 19:08:16.108 19:08:16.108 Hercules has crashed! 19:08:16.108 19:08:16.108 (you may or may not need to press ENTER if no 'oops!' dialog-box appears) 19:08:16.108 19:08:24.488 Creating crash dump "C:\hercules\Hercules-4.2.0.9612\Hercules.dmp"... 19:08:24.488 19:08:24.488 Please wait; this may take a few minutes... 19:08:24.488 19:08:24.489 (another message will appear when the dump is complete) 19:08:24.489 19:08:28.307 19:08:28.308 MiniDumpWriteDump failed! Error: 2147942699 19:08:28.308 19:08:28.319 fthread_create: MyCreateThread failed

zip file with both dumps will be uploaded in a few minutes.

Fish-Git commented 5 years ago

@werner02815:

@fish: sorry to come back again: I have installed 9612 version and rerun my procedure. Unfortunately hercules crashes once again (2 tries, 2 crashes).

zip file with both dumps will be uploaded in a few minutes.

Don't bother! I can easily reproduce the crash myself.

I obviously f**ked something up somewhere. <grumble>

I am looking into the problem right now, and will let you know when it's fixed. In the mean time I will delete the snapshot since it is bad.

Sorry about this! :(

Fish-Git commented 5 years ago

Fixed in commit 39ae7078aa252f72f162f7c798b252b230ea93b3.

Windows x64 pre-built download:

~~Hercules-4.2.0.9614-SDL-g39ae7078.bin.zip~~

Re-closing... (hopefully for good this time!)

Fish-Git commented 5 years ago

@werner02815, @pjfarleyiii:

Please try again. Thanks!

SDL-Hercules-390 / hyperion

Hercules has crashed - continuation of #187 #194