SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

z/VM 7.2 doesn't work with compressed DASDs #464

Closed giacobbis closed 2 years ago

giacobbis commented 2 years ago

Hi all,

This my environment: Windows 10, Hercules 4.4 pre-built for Windows: Hercules version 4.4.0.10631-SDL-gbf377f63 (4.4.0.10631).

With my great satisfaction I could run z/VM 7.2 under Hercules *`()`**, but ONLY with non-compressed DASDs. If I use compressed DASDs for Page and Spool (I tried both CCKD and CCKD64 formats), I get some nasty errors:

......IPL z/VM 7.2 ......
01:41:33 Start ((Warm|Force|COLD|CLEAN) (DRain) (DIsable)  (NODIRect)          
01:41:33       (NOAUTOlog)) or (SHUTDOWN)                                       
01:41:38 COLD DRAIN                                                             
01:41:38 NOW 01:41:38 EST SATURDAY 2022-01-01                                   
01:41:38 Change TOD clock (Yes|No)                                              
01:41:41 NO                                                                     
01:41:41 The directory on volume M01RES at address 3012 has been brought online.
01:41:41 HCPERP517I  DASD  3012 AN OPERATION WAS TERMINATED BECAUSE AN         
01:41:41 HCPERP517I  UNRECOGNIZED ERROR OCCURRED                               
01:41:41 HCPERP6303I SENSE = INVALID                                           
01:41:41 HCPERP6304I IRB = 00C04017 3FB3D008 00000000 00800000                 
01:41:41 HCPERP6305I USERID = SYSTEM                                           
01:41:41 HCPERP2216I CHANNEL PATH ID = 30                                      
01:41:41 HCPPAW415E Six continuous paging errors have occurred on DASD 3012 vol
.....
01:41:41 HCPPAW415E Six continuous paging errors have occurred on DASD 3012 volu
me M01RES.                                                                      
01:41:41 HCPPAW415E Six continuous paging errors have occurred on DASD 3012 volu
me M01RES.                                                                      
01:41:41 HCPPAW415E Six continuous paging errors have occurred on DASD 3012 volu
me M01RES.                                                                     

(NOTE: These errors appeared only on VM 3270 console. In hercules log, no errors at all.)

Also, z/VM 7.1 with compressed DASDs gave the same kind of errors, but I could 'cure' them with the PAGING63 IPL option.

If I IPL a z/VM 6.3 nucleus with the same DASDs, everything works OK.

I'm aware that starting from z/VM 7.1 there were some modifications in Page and Spool access. It looks to me as though the Hercules compressed DASD logic in some way doesn't support this new kind of access.

*`()`** In order to IPL z/VM 7.2, I had to enable the FACILITY bit 55.

Regards.

Paul

Fish-Git commented 2 years ago

Is your z/VM 7.2 ADCD? Or not?

Fish-Git commented 2 years ago

Never mind!

CONFIRMED!

I was able to recreate your problem!

It works just fine with UN-compressed dasd (e.g. CKD64), but fails with compressed (e.g. CCKD64) dasd!

Very weird! (and very concerning too!!)

I will begin looking into this right away, as it clearly indicates a problem in CCKD and/or CCKD64 logic!

(Yikes!)   8-o

Fish-Git commented 2 years ago

I was able to recreate your problem!

Well... I wasn't able to recreate your exact problem (I did not experience your z/VM errors, i.e. the HCPERP... and HCPPAW415E messages; I did not experience those), but I WAS able to recreate the general "doesn't work with compressed dasd images but works just fine with uncompressed dasd images" problem. That I was able to recreate.

In my case, the failure that occurs when compressed dasds are using is a very tight hard CPU loop (checking some bit somewhere) immediately after the "brought online" startup message. When regular (uncompressed) CKD64 dasds are used, no loop occurs and the system comes up just fine.

Maybe it's because mine is an ADCD distro? Is yours an ADCD distro too? I'm guessing it's not, and that's probably why our respective failures are different. But that's just a guess.

In any case, I'm going to be digging into this problem right away! Top priority!

giacobbis commented 2 years ago

Hi Fish,

Excuse me for my late reply. :-)

I confirm: my z/VM7.2 is an ADCD distribution.

The different behaviour you're experiencing is easily explained: If I COLD Start VM, I get all the weird I/O errors on VM console. If I WARM Start VM, I go in a tight loop.

Usually ADCD systems have WARM start as a default in System Config file, so you get a tight loop without having any chance to modify.

I'm glad you're looking into the problem, since it is very important for me to be able to use compressed DASD.

Thank you again!

(and BTW: Happy New Year too!)

Paul

Fish-Git commented 2 years ago

I confirm: my z/VM7.2 is an ADCD distribution.

The different behaviour you're experiencing is easily explained: If I COLD Start VM, I get all the weird I/O errors on VM console. If I WARM Start VM, I go in a tight loop.

Usually ADCD systems have WARM start as a default in System Config file, so you get a tight loop without having any chance to modify.

Okay, that makes sense. Thanks!

I'm glad you're looking into the problem, since it is very important for me to be able to use compressed DASD.

It's important for me too!   (And a lot of other Hercules users as well!)

And guess what?

I FOUND THE BUG!   :-D

It's late though, so I'll probably commit the fix sometime tomorrow.

giacobbis commented 2 years ago

Great!

I knew you're one of the smartest guys I ever met - and I met quite a few, having worked in IBM Italy as a System Engineer for about 20 years.

But anyhow your performance astonished me.

Take it easy.. I know there are 9 hours of difference in time zone between Seattle and Milan (Italy). Just a kind request, if possible.

About 10 years ago I was able to perform some mods to Hercules; I remember that I was the first to run z/VM 6.3 in Hercules 3.08 after modifying the code to support topology instructions. (and Roger Bowler was so kind to mention me as a contributor).

But now I'm not willing to deal with all the procedure to rebuild Hercules (Git clone and so on); so, please, please, if you can share with me the new (corrected) prebuilt version of Hercules for Windows, I'll be very grateful to you.:-)

Regards

Paul

Fish-Git commented 2 years ago

But now I'm not willing to deal with all the procedure to rebuild Hercules (Git clone and so on); so, please, please, if you can share with me the new (corrected) prebuilt version of Hercules for Windows, I'll be very grateful to you.:-)

Normally I hate handing out private builds to people, but this might be an exception. I'll see what I can do.

But in all honesty, this bug is IMHO serious enough to warrant a 4.4.1 "fix" release. That is to say, even though 4.4.0 was only officially released just 2.5 weeks ago, due to the seriousness of this particular bug, we might need to make a new official release of Hercules: version 4.4.1.

But I need to discuss it with my fellow developers first.

(How urgent is this fix to you? How long can you comfortably wait for it?)

Fish-Git commented 2 years ago

Okay, here's the bug and, further below, the patch that fixes it (which I have already committed):

In module ckddasd.c's 0xE7 ccw LOCATE RECORD logic, after seeking to the requested track (ckd_seek), it sets unitstat to CE+DE = normal status (line 4868) and then performs a search according to specified orientation, and then afterwards, checks the completion status (unitstat) to make sure it succeeded (line 4907).

For some unknown reason, between those two points in the logic flow, the unitstat changed from normal completion X'0C' = CE+DE, to X'00'(!), thereby causing the success/failure test on line 4907 to become true (i.e. "an error occurred"), thereby causing the critical logic to set data orientation (lines 4910 - 4924) to be skipped!  (Oops!)

https://github.com/SDL-Hercules-390/hyperion/blob/bf377f63ed4f943d107343c08195b6c2d66a7bf8/ckddasd.c#L4861-L4925

I managed to eventually track down the problem to a single "rogue" (erroneous/accidental/unintentional) statement in the cckd_read_track function that was added by the original author back on May 1, 2002:

Revision: 812a557bc2b2e4723f379290c02885b7a64678fa Author: Greg Smith Date: 5/1/2002 8:40:58 PM Message: 01 May 2002 3390-9 and large file support - Greg Smith

(Yep! This bug has been in Hercules for almost 20 years now!)

Anyway, the fix was to remove the rogue (erroneous/accidental/unintentional) statement that was resetting unitstat to zero in the cckd_read_track function:

https://github.com/SDL-Hercules-390/hyperion/blob/bf377f63ed4f943d107343c08195b6c2d66a7bf8/cckddasd.c#L864

and associated functions (the bug existed not only in our CCKD logic but also in our CFBA logic too, and was carried forward into our CCKD64/CFBA64 logic as well).

As soon as that is done, everything started working just fine, and z/VM 7.2 came right up, clean as a whistle!  :)

Here's the patch which very clearly shows the actual bug and how it was fixed (which, as I said, has already been committed):

From: "User Name" <user@example.com>
Date: Tue, 04 Jan 2022 09:54:07 -0800
Subject: [PATCH] Short description

Details of change, perhaps spanning...
...multiple...
...lines.

diff -r -a -x .git -x 'msvc.AMD64.*' -x 'msvc.dllmod.*' -x 'msvc.debug.*' -x '*.suo' -x '*.ncb' -x '*.user' -x '*.htm' -x WORK -x DICTS -x FILES -x 'allTests.*' -x '*.rej' -x '*.orig' -x AutoBuildCount.h -x '*.cmp*' -x '*.comp*' -Nu hyperion-1/cckddasd.c hyperion-0/cckddasd.c
--- hyperion-1/cckddasd.c   2021-03-06 11:03:35.570094100 -0800
+++ hyperion-0/cckddasd.c   2022-01-04 01:12:25.048317400 -0800
@@ -861,7 +861,6 @@

     /* read the new track */
     dev->bufupd = 0;
-    *unitstat = 0;
     cache = cckd_read_trk (dev, trk, 0, unitstat);
     if (cache < 0)
     {
@@ -1046,7 +1045,6 @@

     /* Read the new blkgrp */
     dev->bufupd = 0;
-    *unitstat = 0;
     cache = cckd_read_trk (dev, blkgrp, 0, unitstat);
     if (cache < 0)
     {
diff -r -a -x .git -x 'msvc.AMD64.*' -x 'msvc.dllmod.*' -x 'msvc.debug.*' -x '*.suo' -x '*.ncb' -x '*.user' -x '*.htm' -x WORK -x DICTS -x FILES -x 'allTests.*' -x '*.rej' -x '*.orig' -x AutoBuildCount.h -x '*.cmp*' -x '*.comp*' -Nu hyperion-1/cckddasd64.c hyperion-0/cckddasd64.c
--- hyperion-1/cckddasd64.c 2021-03-02 23:17:47.835962300 -0800
+++ hyperion-0/cckddasd64.c 2022-01-04 01:08:57.708353200 -0800
@@ -621,7 +621,6 @@

     /* read the new track */
     dev->bufupd = 0;
-    *unitstat = 0;
     cache = cckd64_read_trk (dev, trk, 0, unitstat);
     if (cache < 0)
     {
@@ -806,7 +805,6 @@

     /* Read the new blkgrp */
     dev->bufupd = 0;
-    *unitstat = 0;
     cache = cckd64_read_trk (dev, blkgrp, 0, unitstat);
     if (cache < 0)
     {

Here's the same thing as a downloadable patch file:

Fixed by commit 120c28b8559f822d8bbe01a46652359933b0a3bb.

Closing this issue as resolved.

Please let me know if I've missed something and it still doesn't work for you, and I'll gladly reopen it.

Thanks.

giacobbis commented 2 years ago

Hi Fish,

With great efforts I could build the Hyperion corrected version, and...

It works like a charm!

BTW, I didn't deal with Git, clone and all that crap; I simply downloaded the version 4.4 source and modified by hand the four 'guilty' instructions. Then a makefile and voilà: a perfectly working version.

Thanks for all.

Now I'm waiting for the future (but not yet scheduled) z/OS 2.5 ADCD....

Paul

Fish-Git commented 2 years ago

With great efforts I could build the Hyperion corrected version, and...

It works like a charm!

Great! Glad to hear it. Well done!

FYI: after virtually zero debate, we have all agreed there will be a new 4.4.1 hot fix release. It should hopefully be ready "very soon" (within days I'm guessing).

Now I'm waiting for the future (but not yet scheduled) z/OS 2.5 ADCD....

Aye. Me too.