SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
245 stars 92 forks source link

Multipath I/O config causes an IEHLIST I/O error on one of the paths #594

Closed SE20225 closed 9 months ago

SE20225 commented 1 year ago

Hercules version: 4.6.0.10941-SDL-g65c97fd6 Running on: T480S (Windows-6.2.9200 Intel(R) x64) LP=8, Cores=4, CPUs=1

My testing with optional paths has now progressed to 3350 with device 0140, volser WORK00, being the first 3350 in the config. It is actually a completely empty volume, with just a VTOC in the rest of cyl 0.

A lot of reading works OK on any of the paths, but the two available paths definitely work slightly differently.

The MVS involved is a TK4- regenned to include FEATURES=ACR and CRH. (i.e. support for multiprocessing and Channel Recovery Hardware, Connect and Disconnect Channels)

Snippet from Hercules configuration file:

0:0140  3350  dasd/work00.140
1:0140  3350  localhost::0140

Running IEHLIST LISTVTOC DUMP with only path via only CP 0 gives CC=0, everything works perfectly:

                              SYSTEMS SUPPORT UTILITIES---IEHLIST             
                CONTENTS OF VTOC ON VOL WORK00                                
 FORMAT 4 DSCB NO AVAIL/MAX DSCB /MAX DIRECT  NO AVAIL NEXT ALT    FORMAT 6   
           VI   DSCBS   PER TRK  BLK PER TRK  ALT TRK  TRK(C-H)    (C-H-R)    
           00    1361       47         36        150    555   0               
 FORMAT 5 DSCB     A = NUMBER OF TRKS IN ADDITION TO FULL CYLS IN THE EXTENT  
       TRK  FULL          TRK  FULL          TRK  FULL          TRK  FULL     
      ADDR  CYLS   A     ADDR  CYLS   A     ADDR  CYLS   A     ADDR  CYLS   A
        30   554    0                                                         
                 DSCB(C-H-R)   0   1   2                                      

 THERE ARE  554 EMPTY CYLINDERS PLUS     0 EMPTY TRACKS ON THIS VOLUME        
 THERE ARE  1361 BLANK DSCBS IN THE VTOC ON THIS VOLUME  

With both paths online, this is the typical result:

                              SYSTEMS SUPPORT UTILITIES---IEHLIST               
                CONTENTS OF VTOC ON VOL WORK00                                  
 FORMAT 4 DSCB NO AVAIL/MAX DSCB /MAX DIRECT  NO AVAIL NEXT ALT    FORMAT 6    L
           VI   DSCBS   PER TRK  BLK PER TRK  ALT TRK  TRK(C-H)    (C-H-R)     D
           00    1361       47         36        150    555   0                 
 FORMAT 5 DSCB     A = NUMBER OF TRKS IN ADDITION TO FULL CYLS IN THE EXTENT    
       TRK  FULL          TRK  FULL          TRK  FULL          TRK  FULL       
      ADDR  CYLS   A     ADDR  CYLS   A     ADDR  CYLS   A     ADDR  CYLS   A   
        30   554    0                                                           
                 DSCB(C-H-R)   0   1   2                                        

IEH108I REQUEST TERMINATED --- PERMANENT I/O ERROR WHILE READING DATA SET       

IEHLIST uses BSAM to read the VTOC (as a sequential file) and the CCW chain looks like this:

SET SECTOR
SEARCH ID
TIC BACK
TIC TO *+8
READ COUNT,MT  count=8
READ KEY and DATA,MT, count = x8C (that is 44+100, DSCB size)
READ SECTOR

After having read all DSCBs in a track, 47 decimal, it searches to record 2F, the last one, and checks on READ COUNT with a sense of 0004, File Protect.

The extent is the remainder of cyl 0, but the file mask (not actually traced, but taken from the DEB) is 58 so no multitrack operation is allowed, so the sense seems very much in order. And the code switches to next track and continues to read all-zero DSCBs. This when things go right.

I have taken a dump when the IEH108I message is to be issued and it shows:

0A4B20   04000000 00000000 00000000 00000000    910A4B30 7F000000 06000000 410A6194
0A4B40   000A4B90 0C400000 400A4B60 000A6520    00000000 00000000 00000000 0000022F
0A4B60   230A4BA0 40000001 310A4B5B 40000005    080A4B68 0000008C 080A4B80 70000001
0A4B80   920A4B98 40000008 8E0A6974 6000008C    220A4BA0 00000001 789CCDCE D5110273
0A4BA0   00000000 00000000 00580008 80995064    00000048 00000000 00000000 00000000
0A4BC0   00020000 00000000 00000000 00000000    00000000 00000000 00000000 00000000
0A4BE0   00000000 00000000 00000000 00000000    00000000 00000000 00000000 00D6B468

The IOB starts at A4B38. The error post code of 41 at A4B3C is caused by the incorrect length return in the CSW which is at A4B40. The CCW chain starts at A4B60. After positioning with SEARCH ID to the last record on the track, the result should have been a norecordfound but instead it appears that garbage is read instead of a reasonable count field at A4B98 and the keyanddata field at A6974. Well noreordfound or rather a switch to the next track as long as we remain inside the VTOX extent which is the res of cyl 0.

0A6960   40404040 40404040 40404040 40404040    40404040 1000B085 871CEEEE EED012FD
0A6980   37013FCC D0425241 22224544 167F2977    8F775852 1E3C25F0 54004F45 F054024F
0A69A0   65F09481 A70A78AA 82A71A78 AA83A706    786A82A7 16786A83 A70E78EA 82A71E78
0A69C0   EA83A701 781A82A7 11781A83 A709789A    82A71978 9A83A705 785A82A7 15785A83
0A69E0   A70D78DA 82A71D78 DA83A703 783A82A7    13783A83 A70B78BA 82A71B78 BA83A707
0A6A00   00000000 00000000 00000000 00000000    00000000 00000000 00000000 00000000
0A6A20   00000000 00000000 00000000 00000000    001E022A 00000000 00000000 00000000
0A6A40   00000000 00000000 00000000 00000000    00000000 00000000 00000000 00000000

At a superficial glance, this looks very much like the count field (garbage?) data.

The enclosed file MVS_LOG_2paths contains the Hercules CCW trace from a different test with the same end result, it only processed more tracks before running into the problem.

This time it was searching to 000000192F. So plenty of tracks were processed OK and then read a count field containing 789CCDCED5110273 and the following data field starting at A6974:

0A6960   40404040 40404040 40404040 40404040    40404040 1000B0C5 0F777BB8 3BB444FF
0A6980   4DC00F33 B4905490 88481191 C55FCADD    E31D9694 074F05F0 54044F25 F054064F
0A69A0   15F09481 A72A78AA 81A73A78 6A80A726    786A81A7 3678EA80 A72E78EA 81A73E78
0A69C0   1A80A721 781A81A7 31789A80 A729784A    E069069E E6E06901 9E96E069 059ED6E0
0A69E0   69039EB6 E069079E F6E0E900 9E8EE0E9    049ECEE0 E9029EAE E0E9069E EEE0E901
0A6A00   00000000 00000000 00000000 00000000    00000000 00000000 00000000 00000000

The 'garbage' data read looks similar, 'C' code for x86 or compressed tracks?

Trying IEHLIST with only path through CP 1 gives an ABEND913-20 issued during OPEN (TYPE=J) for the VTOC as a sequential dataset. Solid failure.

Searches for R1 in the first VTOC track, the FMT4 DSCB. Location is from the UCB. R0 seems reasonable then something which should not be in the VTOC and ends with no record found after two revolutions.

It seems as if Hercules always start at HA so why take another round when the 255/255/65535 end-of-track marker is found? Slight performance improvement possible!

Hercules CCW trace in file "MVS_LOG_CPU1Only.txt"

My guess is that somehow the two interfaces to the disk interfere or are not properly serialized causing the pointer to next record to become corrupt. I have therefore tried a more symmetrical setup like:

0101 3350 dasd/work00.140
0:0140 3350 localhost::0101
1:0140 3350 localhost::0101

where 0101 is a device which is not touched by MVS.

Everything I have tried so far works fine, so if the above story describes a defect, I have a fine bypass to allow my MP experimentation to continue. Maybe I should try the remaining DASD types as well.

It seems rather trivial to recreate my problem. Maybe there is more room in the track after the last DSCB for the other types?

I also happened to see issue #575, where there are also strange count fields.

Please advise

Anders Edlund, andersedlund@telia.com

Fish-Git commented 1 year ago

Before going any further, I need you to closely review the above editing I have performed on your original problem report to make sure it is correct.

Your original report was badly formatted, so I had to edit it so that it would display properly. I may have accidentally deleted some important information?

So please review the above problem description and make any needed changes to ensure the information presented is accurate. Thanks.

(You can edit any comment you make to GitHub by simply clicking the ... on the right side of the comment's title, and selecting "Edit" from the list of available actions.)

SE20225 commented 1 year ago

I added some text to clarify the pieces from the MVS dump rather than trying to edit these parts themselves.

Fish-Git commented 9 months ago

@SE20225

Same response here as for issue #593.

SE20225 commented 9 months ago

This has never reoccurred after I changed the config to:

0F40    3350  dasd/work00.140
0:0140  3350  localhost::0F40
1:0140  3350  localhost::0F40

where 0F40 is a device which is not used by the MVS system involved.