SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

z/VM 7.2 IPL'ing as guest of itself CCW Command Rejects Aaron says "quick fix" #572

Closed zVMJedi closed 1 year ago

zVMJedi commented 1 year ago

Hello, Charles here.

Aaron said to open an Issue, so here it is.

He pretty well lays it out in the message thread about this issue, one last lousy little bit in byte 6 of the z/VM RDCBK Real Device Characteristics control block isn't being set correctly to handle PFX (CCW opcode X'E7'): byte 6 needs to be set with 'D2' instead of the 'D0' it contains otherwise.

And I'm running Hyperion 4.5 on a Windows Server 2008 R2 Host, so no tricky Linux builds.

Thank you, and there is no urgency about this; I can stuff those Byte 6's with 'D2' from an EXEC for all my CP OWNED DASD. Maybe just slipstream it into 4.6 along with other fixes. I'm about to pull 4.6 and get it going.

Regards,

Charles Perkins

zVMJedi commented 1 year ago

Fish? There is NO urgency about this. I'm probably the only Member of our Community having this problem, that of wanting 7.2 to IPL under itself with the DEVNO DASD scheme, and it doesn't need to be fixed that "soon". Work on it as your time permits; this is no Sev1.

Fish-Git commented 1 year ago

DON'T BOTHER WITH THE PREVIOUSLY REQUESTED I/O CCW TRACE!

By changing my 2nd level directory entry to use MDISK instead of DEDICATE, I am now able to reproduce the problem, and am actively looking into it.

Besides, Hercules's CCW trace logic currently only traces the first 16 bytes of a given CCW, and for the E7 Prefix CCW we need to see 64 bytes. (I'm inserting temporary debugging code for that too)

Just hang loose for a while. I'll eventually figure out where things are going wrong. I just need a little time.

Thanks.

wrljet commented 1 year ago

Charles, while Fish looks into things, it might be an opportunity to get you building Hercules without nervousness. Also any trouble you encounter will help me/us improve the process for others.

I know this isn't really the place for the discussion, but...

Did you do this part of the pre-steps for Windows 7 / PowerShell 5.1?

Out of the box, a fresh Windows 10 installation will not allow you to run PowerShell scripts, for security reasons. We need to relax that. This step will only need to be performed once, the first time you use Hercules-Helper.

Open a PowerShell prompt "As Administrator", and run:

Set-ExecutionPolicy RemoteSigned

Answer Yes when prompted.

For Windows 7 and PowerShell 5.1, also run:

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

Close the PowerShell window. The changes will go into effect when you next open PowerShell as a normal user.

Fish-Git commented 1 year ago

Fish? There is NO urgency about this.

I fully understand that. I'm not trying to rush. I just hate bugs!   :)

And if there's a known bug in Hercules that needs to be fixed, when is the best time to fix it?

Answer: right now!

There's no reason to delay the fix to another time. Fixing it now is as good a time as any, right?

Work on it as your time permits;

Which is exactly what I'm doing!   :)

this is no Sev1.

I know that. See above.

zVMJedi commented 1 year ago

Jedi, while Fish looks into things, it might be an opportunity to get you building Hercules without nervousness. Also any trouble you encounter will help me/us improve the process for others.

I know this isn't really the place for the discussion, but...

Did you do this part of the pre-steps for Windows 7 / PowerShell 5.1?

Out of the box, a fresh Windows 10 installation will not allow you to run PowerShell scripts, for security reasons. We need to relax that. This step will only need to be performed once, the first time you use Hercules-Helper. Open a PowerShell prompt "As Administrator", and run: Set-ExecutionPolicy RemoteSigned Answer Yes when prompted. For Windows 7 and PowerShell 5.1, also run: [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 Close the PowerShell window. The changes will go into effect when you next open PowerShell as a normal user.

Yes, all of that. I did notice PS would not take Tls12, I had to settle for Tls I installed ObREXX just fine, I was just having uncertainty about the command line options at build time but you answered those with your example Feel free to e-mail me at < TOUGH LUCK, BOTS! > ( yes, Earthlink still runs us Original Netcom'ers off a sub-domain mailbox! )

Fish-Git commented 1 year ago

I know this isn't really the place for the discussion, but...

Here is fine!

wrljet commented 1 year ago

I did notice PS would not take Tls12, I had to settle for Tls

Hmm, well, no idea why. But unless and until there's a problem, we'll ignore that.

I'm really not a PowerShell expert. As far from it as imaginable. And I never fiddled with it before Windows 10 and the whole Hercules-Helper project.

wrljet commented 1 year ago

I grabbed your email address. You might want to edit that out of the comment now, so the 'bots don't harvest it.

arfineman commented 1 year ago

I have a standalone OS that I run under Hercules and all tests on PFX ran fine and the RDC looks good as well. Format 0 Message 4 means 'Invalid Parameter'. So we need to see the 64 bytes of the Prefix parameter to see what CP sent. I doubt very much that CP would send an invalid parameter. Best regards,

zVMJedi commented 1 year ago

Correction to update - 1st level working fine except for the 'E7' Command Rejects whenever a UserID logs on . 2nd level needs further test.

zVMJedi commented 1 year ago

Second level with DEDICATE'd CP VOLS, comes up with the same 'E7' Command Reject that 1st level shows, but otherwise comes up and runs. With DEVNO definitions, now gives this:

19:09:55 Q DA                                                                   
19:09:55 DASD  0720 CP OWNED  VMCOM1   0                                        
19:09:55 DASD  0721 CP SYSTEM 720RL1   0                                        
19:09:55 DASD  1723 CP OWNED  M01RES   37                                       
19:09:55 DASD  1724 CP OWNED  M01S01   0                                        
19:09:55 DASD  1725 CP OWNED  M01P01   0                                        
19:10:13 I CMS                                                                  
19:10:13 HCPERP515I  DASD  1724 AN OPERATION WAS TERMINATED BECAUSE AN END     
19:10:13 HCPERP515I  OF CYLINDER OCCURRED                                      
19:10:13 HCPERP6300I SENSE DATA FORMAT = 00       MSG CODE = 00                
19:10:13 HCPERP6301I CHANNEL COMMAND WORD COMMAND CODE = N/A HPF               
19:10:13 HCPERP6302I SEEK ADDRESS =   0000009C000E                             
19:10:13 HCPERP6303I SENSE = 00200000 00FFFF00 00000000 00000000 00000000      
19:10:13 HCPERP6303I 00000000 00000080 00009C0E                                
19:10:13 HCPERP6304I IRB = 00C24017 1F6A5090 0E000000 00800000                 
19:10:13 HCPERP6305I USERID = SYSTEM                                           
19:10:13 HCPERP2216I CHANNEL PATH ID = 17                                      
19:10:13 HCPMCV1459E The virtual machine is placed in check-stop state due to a 
system malfunction with CPU 00.                                                 
19:10:58 B                                                                      
19:10:58 HCPCFF1455E CPU 00 is not started because it is in check-stop state.   
19:11:20 I CMS                                                                  
19:11:21 HCPERP515I  DASD  1724 AN OPERATION WAS TERMINATED BECAUSE AN END     
19:11:21 HCPERP515I  OF CYLINDER OCCURRED                                      
19:11:21 HCPERP6300I SENSE DATA FORMAT = 00       MSG CODE = 00                
19:11:21 HCPERP6301I CHANNEL COMMAND WORD COMMAND CODE = N/A HPF               
19:11:21 HCPERP6302I SEEK ADDRESS =   0000009C000E                             
19:11:21 HCPERP6303I SENSE = 00200000 00FFFF00 00000000 00000000 00000000      
19:11:21 HCPERP6303I 00000000 00000080 00009C0E                                
19:11:21 HCPERP6304I IRB = 00C24017 1F69D090 0E000000 00800000                 
19:11:21 HCPERP6305I USERID = SYSTEM                                           
19:11:21 HCPERP2216I CHANNEL PATH ID = 17                                      
19:11:21 HCPMCV1459E The virtual machine is placed in check-stop state due to a 
system malfunction with CPU 00.                                                 

That END OF CYLINDER and CPU 00 being put in check-stop are new, and it's no longer reporting the CCW code either.

6:36pm CDT went back to DEDICATED devices, 2nd level now comes up with same Command Rejects as first-level, but no complaining about END OF CYLINDER or CPU's in check-stop. IPL CMS and IPL 190 working again.

Fish-Git commented 1 year ago

UPDATE:

It now looks like it's going to take me a lot longer than originally anticipated.   :(

I have a standalone OS that I run under Hercules and all tests on PFX ran fine and the RDC looks good as well.

Aaron?

Do you think I can get a copy of that standalone OS of yours? (along with instructions on how to run/use it?)

Or maybe you can do the following for me?

I'd like to extract all of the E7 Prefix CCW commands that your OS is issuing (that you are saying work just fine on Hercules) so that I can add them into a new Hercules standalone "runtest" Quality Assurance test.

Having a quick test program that tests various combinations of he E7 Prefix CCW sure would be something very handy/valuable to have. It would ensure (verify) that our new/fixed E7 CCW handling code (that I'm going to have to eventually write) is working properly, so that problems such as the ones we're currently having now don't re-occur again at some point in the future.

Can you help me out? Thanks!

Fish-Git commented 1 year ago

Format 0 Message 4 means 'Invalid Parameter'. So we need to see the 64 bytes of the Prefix parameter to see what CP sent.

Here you go!

15:08:03.186 HHC01318I 0:0123 CHAN: test I/O: cc=0
15:08:03.187 HHC01334I 0:0123 CHAN: ORB: IntP:00EC5CE8 Key:0 LPM:80 Flags:0C200 ....FP....H. ........ CCW:5554D560
15:08:03.187 HHC01315I 0:0123 CHAN: ccw E7400041 5554D510=>01800000 00000000 00000000 40C00000 00000000 01ED0004 01ED0000 00000000 00000000 00000000 00000000 3F00000D 01ED0004 00000000 00FF0000 000A0001 ............ {..................................................
15:08:03.187 +++ CR,04 @ ckddasd.c(4763)
15:08:03.187 HHC01315I 0:0123 CHAN: ccw E7400041 5554D510=>01800000 00000000 00000000 40C00000 00000000 01ED0004 01ED0000 00000000 00000000 00000000 00000000 3F00000D 01ED0004 00000000 00FF0000 000A0001 ............ {..................................................
15:08:03.187 HHC01312I 0:0123 CHAN: stat 0E00, count 0000
15:08:03.187 HHC01313I 0:0123 CHAN: sense 80000000 00FFFF04 00000000 00000000 00000000 00000000 00000080 00011800
15:08:03.187 HHC01314I 0:0123 CHAN: sense CMDREJ
15:08:03.187 HHC00806I Processor CP01: I/O interrupt code 0001000C parm 00EC5CE8 id 00000000
15:08:03.188 HHC01317I 0:0123 CHAN: scsw 00C24017, stat 0E00, count 0000, ccw 5554D568
  1. It's a slight coding anomaly (bug) as to why the CCW is traced twice, and isn't really worth fixing at this point.
  2. The +++ CR,04 @ ckddasd.c(4763) message is my debug message that identifies precisely where in Hercules code that things are going wrong.  _(It's failing to check for CKDOPER_EXTOP == X'3F')_

Here's the breakdown of the E7 CCW data:

E7 = 01800000 00000000 00000000 40C00000
     00000000 01ED0004 01ED0000 00000000
     00000000 00000000 00000000 3F00000D
     01ED0004 00000000 00FF0000 000A0001

0        4        8        12
01800000 00000000 00000000 40C00000

01 = Locate Record Extended
80 = Define Extent field valid

16       20       24       28
00000000 01ED0004 01ED0000 00000000

32       36       40       44
00000000 00000000 00000000 3F00000D

Define Extent:

  Mask byte:                    40
  Global Attributes:            C0
  Blocksize (bytes):            0000
  (ignored):                    0000
  Global attributes Additional: 00
  Global attributes Extended:   00
  Beginning of Extent Address:  01ED0004
  End of Extent Address:        01ED0000                <---- (Yep!)----<<<
  System Time Stamp:            00000000 00000000
  (ignored):                    00000000 00000000

48       52       56       60
01ED0004 00000000 00FF0000 000A0001

Locate Record Extended:

  Operation byte:            3F
  Auxiliary byte:            00
  (reserved):                00
  Count:                     0D
  Seek address:              01ED0004
  Search argument:           0000000000
  Sector number:             FF
  Transfer length factor:    0000
  (reserved):                00
  Extended operation byte:   0A
  Extended parameter length: 0001

  Extended Parameter:        xx  (unknown!)

I doubt very much that CP would send an invalid parameter.

I doubt it too.

It sure seems to be okay/valid. Would you agree?

arfineman commented 1 year ago

Hi Fish, This is a valid locate record extended operation. I know Hercules doesn't support that. But it bothers me to see why zVM is sending it and it wasn't before. Let me check a few things and get back to you. Best regards,

arfineman commented 1 year ago

Fish, Are you 100% certain about the beginning and end of extent? The parameter shows the begin is higher than end and that would make it invalid. Locate Record Extended operations are described in SA22-1025-00 that was recently posted. Best regards,

Fish-Git commented 1 year ago

This is a valid locate record extended operation.

Yep.   (kind of, sort of; see my next reply further below)

I know Hercules doesn't support that.

Didn't, not doesn't, would be more accurate.

You are correct that the official 4.6 release doesn't support it, but we're (I'm!) in the process of trying to add support for it for Charles so he can IPL his second level system under z/VM. That's what I've been working on for the past 2+ days.

You said it was going to be a "quick fix". You said all we had to do was turn on RDC byte 6 bit 6 -- the "Prefix CCW is supported and enabled" bit. So that's what I did.

But then we discovered the bark was apparently worse than the bite. The so called "quick fix" did more harm than good. It caused an avalanche of problems which I have been diligently working on to try and get fixed these past two days.

But it bothers me to see why zVM is sending it and it wasn't before.

Because I enabled it in RDC byte 6 bit 6 per your instruction! Duh!   :)

Note: I have since rescinded (backed out) the change/commit that enabled it until I can manage to figure out what's going on.

Fish-Git commented 1 year ago

Are you 100% certain about the beginning and end of extent? The parameter shows the begin is higher than end and that would make it invalid. Locate Record Extended operations are described in SA22-1025-00 that was recently posted.

Yes.   I am ABSOLUTELY 100% CERTAIN the E7 Prefix CCW data that you see that I posted is indeed what z/VM is issuing.

And yes, I too noticed that the end of extent was lower than the beginning of extent, which ends up causing a File Protect error when the LRE seek address is validated against it.

But then once I added the missing support for the E7 Field Validity and Auxiliary bytes (bytes 1 and 3, respectively), the problem went away due to z/VM always setting the X'80' Field Validity bit ("Define Extent Field Valid"; see page 4-43 of SA22-1025-00). When I check for that bit being on and bypass the LRE seek address validation as a result, the FP (File Protect) errors went way.

Unfortunately though, that introduced yet another problem: z/VM now just "hangs" (gets stuck) in the middle of its startup/initialization. That's where I'm at now.

There's still some more code I need to add, but I'm getting tired and need a break.

IN ANY CASE, because it's still not working, and because setting the RDC bit to enable E7 Prefix CCW support clearly is not working correctly, I have BACKED OUT (reverted) my previous commit that enabled it.

I am able to reproduce Charles problem on my own system now, so I can continue working on this issue at my leisure now. I'm sure I'll eventually get it working. It's just going to take me a while, and as Charles said, there's absolutely no rush anyway.

If you have any suggestions regarding (or have any information about) how best to handle the bogus Define Extent End-of-Extent value that z/VM is issuing, please let me know! I think it might be what's possibly causing Hercules's track advancement logic to run away (continue advancing forever). Thanks.

arfineman commented 1 year ago

Hi Fish, I personally think that turning the bit on added more functionality and allowed the second level system to IPL and the I/O error on LRE operation is being retried and caused no harm. If I were to chose between not being able to IPL my second level system and IPLing with an I/O error that recovers, I'll certainly choose the later. I did mention from the start that Prefix should be more associated with LRE than LR. So, until LRE support is added, the PFX support should be limited to what currently is.
Best regards,

arfineman commented 1 year ago

See below from page 4-13 of SA-22-1025. If End of Extent is smaller than Beginning of Extent command should be immediately rejected:

Beginning of Extent Address – Bytes 8 through 11

Bytes 8 through 11 contain the address (CCHH) of the first track in the extent. (“Define Extent Command” on page 3-3 describes an extent.) CCHH must be a valid track address for the logical volume.

If (CCHH) is not a valid track address, the command is rejected with unit check status. The sense data contains command reject with format 0, message 4. See Appendix A, "Device Characteristics" on page A-1 for valid track address ranges.

End of Extent Address – Bytes 12 through 15

Bytes 12 through 15 contain the address (CCHH) of the last track in the extent. (CCHH) must be equal to or greater than the (CCHH) value specified by the “beginning of extent address” in bytes 8 through 11. The track address must be valid for the access authorization and logical volume type. (See “Beginning of Extent Address – Bytes 8 through 11” for the valid requirements.)

If (CCHH) is not a valid track address, or the (CCHH) is less than the value in bytes 8–11, the command is rejected with unit check status. The sense data contains command reject with format 0, message 4. See Appendix A, "Device Characteristics" on page A-1 for valid track address ranges.

arfineman commented 1 year ago

Here is my recommendation in regard to tracing the bogus define extent from the second level system:

arfineman commented 1 year ago

My recommendation for full Prefix command support. Perhaps this is for future, because this requires full LRE support and I think there is already an enhancement request for that.

Byte 0 is '00' Format 0: Byte 1 bit 0 is off: Define Extent not valid. Treat as NOOP CCW and move on with life (for now until HPAV support). Byte 1 bit 0 is on: Define Extent is valid. Process bytes 12-27 as Define Extent CCW. Set flag DX received for chaining requirements.

Byte 0 is '01' Format 1: Byte 1 bit 0 is off: Define Extent valid must be on. Command reject. Byte 1 bit 0 is on: Define Extent is valid. Process bytes 12-27 as Define Extent CCW. Process bytes 44-63(+) as Locate Record Extended CCW. Set flag LRE received for chaining requirements.

Byte 0 is '02' Format 2: Process bytes 62-xx as Perform Subsystem Function CCW.

Anything else command reject

Fish-Git commented 1 year ago

If I were to chose between not being able to IPL my second level system and IPLing with an I/O error that recovers, I'll certainly choose the later.

I choose being able to IPL both first level or second level without any I/O errors, recoverable or otherwise.

Until then, I prefer to prevent any first level I/O errors regardless of whether they're recoverable or not.

Being able to IPL z/VM second level is not a priority.

Ensuring an I/O error free native (i.e. first level) z/VM IPL however, is.

I did mention from the start that Prefix should be more associated with LRE than LR.

I agree. But most of the Prefix logic is already in Herc's LR code, so that's what I have to fix. (I didn't write Herc's Prefix support. Someone else did. Bob Wood did I believe. I'm just the poor sucker that has to deal with it!)

So, until LRE support is added, the PFX support should be limited to what currently is.

Eh? We already support LRE! It's CCW opcode X'4B'.

Support for X'E7' Prefix however, as I explained, was, for reasons unknown, stuffed into Herc's X'47' LR code, so for now, it's going to stay where it is.

Fish-Git commented 1 year ago

See below from page 4-13 of SA-22-1025. If End of Extent is smaller than Beginning of Extent command should be immediately rejected:

That's for the X'63' Define Extent CCW, not the X'E7' Prefix CCW.   (And Herc is already enforcing that in DE.)

According to page 4-43 that describes the Prefix command's Format, Validity and Auxiliary bytes however, since the X'80' "Define Extent Field Valid" bit is on, all Define Extent values (bytes 12-43) should be presumed to be valid.

(I'm presuming the fact that bit 4 of byte 3 (Auxiliary byte) is off _("Check all parameters in the Define Extent and Locate Record Extended fields") should be interpreted as "Check only the Locate Record Extended parameters" instead due to the previously mentioned X'80' Field Validity bit being on.

I admit I could be wrong about that though. Details such as this are unfortunately not specifically documented anywhere, so one is forced to draw logical conclusions from the information that is available.

Besides, like I said, I find it highly unlikely that IBM would code z/VM to issue invalid Prefix CCW commands!

Fish-Git commented 1 year ago

Here is my recommendation in regard to tracing the bogus define extent from the second level system:

  • Logon to the second level userid.
  • Issue SET DIALDROP OFF
  • Issue CP DEF GRAF 020

(snipped)

  • RECEIVE the reader trace file and we can go thru it.
  • That's when I'll believe zVM sent wrong extent info!

I'll do it, but only out of curiosity so we can see what the second level z/VM's I/O looks like, which IMO is immaterial.

To me, the only important thing is what I/O the first level host is issuing in behalf of its second level guest's I/O request, and that information we already have, and unfortunately(?) has what appears to be a bogus Define Extent and-of-extent field (which, as I explained in my previous reply above, I'm presuming should be accepted due to bit 0 of byte 1 being set).

arfineman commented 1 year ago

I misunderstood. I thought the I/O error was occurring on the second level system, not first. The field validity bits are meant to be interpreted as 'Present' not 'Valid'. As I previously posted, if the Define Extent valid bit is off (possible for format 0 only) means a Define Extent CCW will follow in the chain.
Best regards,

arfineman commented 1 year ago

"Logon to the second level system" should have been "Logon to second level userid". Because the original wording could be interpreted as "logon to a userid on the second level system". My apologizes.

(Fish edit: I have made the requested correction to your post. FYI: since it was your post/reply, you could have edited (fixed) it yourself you know!)

Fish-Git commented 1 year ago

Perhaps this is for future, because this requires full LRE support and I think there is already an enhancement request for that.

AFAIK, Hercules already has full LRE support. (And I cannot find any open or closed GitHub Issue related to Locate Record Extended.)

Byte 0 is '00' Format 0: Byte 1 bit 0 is off: Define Extent not valid. Treat as NOOP CCW and move on with life (for now until HPAV support).

That doesn't make sense. Why in the world would one issue a basic Prefix command -- which requires Define Extent parameters -- and then ask that it be ignored?

I suspect you're interpreting the byte 1 bit 0 flag incorrectly. IMO, the bit simply means whether the DE information should be validated or not (i.e. presumed to be correct or not), not that it should be completely ignored! I believe it should still always be processed (i.e. used).

Fish-Git commented 1 year ago

The field validity bits are meant to be interpreted as 'Present' not 'Valid'.

Are you 100% sure about that? Because that doesn't make any sense to me. Why even provide that data in the command if it's not even needed? Why not just leave all of those bytes zero/uninitialized? I question your interpretation!

if the Define Extent valid bit is off (possible for format 0 only) means a Define Extent CCW will follow in the chain.

Huh?! That's nuts! Why bother even having the Prefix CCW in the chain at all then, if it's just going to be immediately followed with a Define Extent?! Sheesh! Just start your chain with Define Extent! That's the silliest thing I ever heard!

arfineman commented 1 year ago

I'm confused (Nothing unusual, it's an everyday thing).

If Hercules has full LRE support then what is this? Missing: DASD "Write Full Track" and "Write Track Data" support #85

Sheesh! Just start your chain with Define Extent! That's the silliest thing I ever heard!

What if you are zVM and a guest just sent you an channel program to execute, and you are about to execute it, but the device is busy. Normally you say Damn! I got to queue this (and believe me device queues pile up very quickly).

But luckily you are on controller that has the HPF feature. You grab a free HyperPAV Alias and put in the base device's CCA in the Prefix CCW and chain the Prefix CCW or the original CCW chain and bingo :) No wait!

Fish-Git commented 1 year ago

If Hercules has full LRE support then what is this? Missing: DASD "Write Full Track" and "Write Track Data" support #85

Hmmm... I missed that. Sorry. That GitHub Issue does indeed mention that Hercules needs LRE support added to it.

WHY it says that I do not know. That GitHub Issue was created, according to GitHub, on Mar 25, 2018, and LRE support has existed in Hercules since November 2007:

Revision: a6bdc97304d578f952131d98b10d358b773c13b9
Author: Greg Smith <gsmith>
Date: 11/20/2007 4:31:38 PM
Message: LRE support (try #1)
git-svn-id: file:///home/jj/hercules.svn/trunk@4476 956126f8-22a0-4046-8f4a-272fa8102e63
----
Modified: CHANGES
Modified: ckddasd.c
Modified: hstructs.h

Revision: 3137fedccefcdec0466b1a3452f27056b77c1ba4
Author: Greg Smith <gsmith>
Date: 11/25/2007 11:30:21 AM
Message: fix LRE length check, thanks Fish!! - Greg
git-svn-id: file:///home/jj/hercules.svn/trunk@4490 956126f8-22a0-4046-8f4a-272fa8102e63
----
Modified: ckddasd.c

Revision: 115c0bd97f9d850f115dac41dc3565d1325d06f2
Author: Fish (David B. Trout) <fish@infidels.org>
Date: 1/10/2013 6:42:34 AM
Message: Fix 3990-3 CKD CU command reject for 0x4B LRE CCW
----
Modified: ckddasd.c

Revision: 5f8118fd842cffa35310e7245e6afdf89b2a6b21
Author: John P. Hartmann <jphartmann@gmail.com>
Date: 2/13/2013 2:12:58 AM
Message: Set EOT mark also for LRE RA
----
Modified: ckddasd.c

I will edit that issue to remove that erroneous/misleading information. The request was specifically only for the Write Full Track (WFT = X'95') and Write Track Data (WTD = X'A5') opcodes anyway. I have no clue how LRE snuck into there.

My apologies for the confusion. I'll get that fixed right away.

zVMJedi commented 1 year ago

I misunderstood. I thought the I/O error was occurring on the second level system, not first. The field validity bits are meant to be interpreted as 'Present' not 'Valid'. As I previously posted, if the Define Extent valid bit is off (possible for format 0 only) means a Define Extent CCW will follow in the chain. Best regards,

You're correct, to a point. The I/O error first appeared on my 2nd Level 7.2 system as I attempted to IPL it on a well-behaved 1st-Level 7.2 on a Hyperion 4.5 Host with the 2nd-level Guest 7.2 using DEVNO-defined DASD.

That produced the Command Reject that kept it from even getting the CP Directory online.

So, I reported the problem.

You had me try the "stuff D2 into Byte 6 of the RDCBK" trick, effectively setting Bit 6, which worked against my SYSRES IPL volume, but then more Command Rejects appeared against the next volume which was my SPOOL volume, probably as UserID's were asking for the CMS Saved System.

Fish went into action, to replicate the problem on his system(s).

He couldn't replicate it at first, because he was using DEDICATE'd DASD definitions. As soon as I pointed that out, he got the error, also.

He wrote the 'first fix'.

The 'first fix' now produces the 'E7' Command Rejects on 1st-level, which seem to be harmless and those same harmless 'E7' Rejects on 2nd-level as long as 2nd-level uses DEDICATE'd DASD.

But the moment DEVNO is tried again for the 2nd-Level 7.2, both levels then give I/O errors, with the 2nd-level again being most severe, including producing the 'END OF CYLINDER' Reject, which appears to cause not only the CMS Saved System to malfunction at IPL (i cms), but also even when IPL'd from its 'home' (the RECOMP'd cylinders on the 190 MDISK (i 190)).

Fish-Git commented 1 year ago

You're correct, to a point. The I/O error first appeared on my 2nd Level 7.2 system ...

Yep. A most excellent and accurate summary of events so far. Thank you, Charles.

I would only add that the 'first fix' that I implemented (only in the 'develop' branch of the repository) was the "just ensure byte 6 bit 6 of the RDC was always initialized to '1'" quick fix, that was later backed out since it seemed to trigger unexpected (and unwanted!) side effects.

That's where we stand today.

I'm still hard at work (but I'm taking my time now) trying to get everything working smoothly for both first and second level, regardless of whether MDISK or DEDICATE is used.

I'll figure it out eventually.   :)

arfineman commented 1 year ago

The request was specifically only for the Write Full Track (WFT = X'95') and Write Track Data (WTD = X'A5') opcodes anyway. I have no clue how LRE snuck into there.

Probably because with CCWs need to be in a LRE domain with specific operations that are not supported by LRE. WFT (x'95') needs to be in a LRE domain with Write Trackset operation and WTD (x'A5') needs to be in a LRE domain with Update Write Trackset operation. Neither operation is currently supported by LRE. Best regards,

Fish-Git commented 1 year ago

Probably because ... WFT (x'95') needs to be in a LRE domain with Write Trackset operation and WTD (x'A5') needs to be in a LRE domain with Update Write Trackset operation. Neither operation is currently supported by LRE.

Ah. I see. That makes sense then. Thanks.

But the point is, support for LRE does already exist. It may not have all that's needed in order to properly implement the two CCWs in question, but that's just part of the chore. There's no call for saying LRE isn't supported when it actually is!

Fish-Git commented 1 year ago

Fix committed: 587608c5bf5ba1dff63baec781a10ef64f8e34ba.

Please do a git pull and rebuild.

Basically what I did was break out the existing Define Extent, Locate Record Extended and Perform Subsystem Function CCW processing logic, and move them into separate callable subroutines, and then simply code a new very simple and straightforward E7 Prefix CCW processing case that just calls into one or more of the previously mentioned separate functions, et voilà! Instant E7 Prefix CCW support!  :)

Of course the actual implementation involved a bit more than just that (the devil is always in the details!), but that was essentially it. It was actually a fairly simple and straightforward and easy fix for once, and it works great! I tested z/OS 2.5c native, z/OS 2.5c under z/VM 7.3, z/VM 7.2 native, and z/VM 7.2 as a second level guest under first level z/VM 7.2, with the second level guest's dasd defined both as dedicated dasd devices AS WELL AS with them defined as full pack minidisks. Everything worked just fine. Everything ran squeaky clean.

IBM DOC BUG?

Interestingly, I discovered a serious omission in IBM's documentation (*) for the E7 Prefix and 4B Locate Record Extended CCWs in manual SA22-1025-00 "S/390 Internal Disk Subsystem - Reference Guide (Multiprise 3000)":

On page 4-28 where it documents the Locate Record Extended's "Operation Byte" (byte 0), it presents a little table for valid Operation Code bit values for bits 2-7:

Value Operation Code
00 0001 Write Data (01)
00 0011 Format Write (03)
00 1011 Write Track (0B)
00 1100 Read Tracks (0C)
01 0110 Read (16)
11 1111 Extended Operation (3F)

The only problem is, it fails to mention that the bit value X'16' (Read) happens to be invalid when used in a E7 Prefix CCW! It is only valid when used in a normal 4B Locate Record Extended CCW!  (Oops!)

When issuing a E7 Prefix CCW, the Locate Record Extended field's Operation Byte value (byte 44) must use X'06' for the Read operation instead, not X'16' like normal.

I discovered this simple fact during testing. Both z/OS and z/VM were both always issuing their E7 Prefix CCWs with X'06' in byte 44, and until I added the special code needed to accommodate for that fact, they each would get a fatal Command Reject I/O error during IPL and go immediately into a disabled wait.

My first attempt at a fix just naively changed the #define from X'16' to X'06', but then all 4B Locate Record Extended I/Os began failing. So I had to tweak the code to check for X'16' only for the 4B CCW opcode, or X'06' for the E7 CCW opcode. Once I did that, then things magically started working across the board, in all test scenarios.  :)

So, almost an "IBM DOC BUG", but not quite. More of a serious omission more than anything.

That's it. Please test thoroughly and report back if you experience any problems. Until then I'm going to mark this issue as closed (resolved).

Thanks for an interesting week! I had fun on this one!  :)


*`()`**  To be fair to IBM though, the SA22-1025-00 manual is suffixed with "-00", meaning it's a first edition of that particular manual. So it's possible I suppose that they corrected the omission in a later edition and I just haven't been able to find it anywhere.

But until one happens to turn up somewhere, I'm going to grab my chance while I still have it, and call it a bona fide IBM DOC BUG, thereby allowing me to rightly redeem my well deserved minor claim to fame. ;-)

zVMJedi commented 1 year ago

Thank you! Ok, with the reminder that just a week ago(!) I was a mere GitHub Voyeur, so when you say 'do a git pull and rebuild', I'm already lost.

Since Bill Lewis was kind enough to supply the command line he used after he kindly did the "first fix" rebuild for me, is the following command going to do it for me?

    .\hercules-buildall.ps1 -BuildDir C:\xfer\hercules-develop -VS2017 -GitBranch develop

where of course I supply my own desired output directory instead of his C:\xfer\hercules-develop.

Or do I maybe need to use the -GitCommit switch with that "commit number", 587608c?

Fish-Git commented 1 year ago

I'm sure Bill will probably jump in here to provide a definitive answer for you, but I believe, yes, that's all you have to do: simply issue that same command just as you did before (but using your output directory as you said), and I believe it should automatically do the pull (refresh) for you and rebuild you the new updated revision.

Bill? True?

p.s. The actual (full) commit number is "587608c5bf5ba1dff63baec781a10ef64f8e34ba", not "587608c". GitHub just likes to shorten what it perceives as an actual git commit hash number into a much shorter helpful(?) link, that when clicked, shows you the actual commit details. But this is git shit that needn't really concern you. I only mention it for the record.

zVMJedi commented 1 year ago

As you can see, Houston we have a problem. I have no idea why it is complaining about my internet connection, it's the one I'm communicating on right now. Any suggestions appreciated. Thank you. NuGet provider is required to continue PowerShellGet requires NuGet provider version '2.8.5.201' or newer to interact with NuGet-based repositories. The NuGet provider must be available in 'C:\Program Files\PackageManagement\ProviderAssemblies' or 'C:\Users\MrPerkins\AppData\Local\PackageManagement\ProviderAssemblies'. You can also install the NuGet provider by running 'Install-PackageProvider -Name NuGet -MinimumVersion 2.8.5.201 -Force'. Do you want PowerShellGet to install and import the NuGet provider now? [Y] Yes [N] No [S] Suspend [?] Help (default is "Y"): Y WARNING: Unable to download from URI 'https://go.microsoft.com/fwlink/?LinkID=627338&clcid=0x409' to ''. WARNING: Unable to download the list of available providers. Check your internet connection. PackageManagement\Install-PackageProvider : No match was found for the specified search criteria for the provider 'NuGet'. The package provider requires 'PackageManagement' and 'Provider' tags. Please check if the specified package has the tags. At C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\1.0.0.1\PSModule.psm1:7405 char:21

PackageManagement\Import-PackageProvider : No match was found for the specified search criteria and provider name 'NuGet'. Try 'Get-PackageProvider -ListAvailable' to see if the provider exists on the system. At C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\1.0.0.1\PSModule.psm1:7411 char:21

WARNING: Unable to download from URI 'https://go.microsoft.com/fwlink/?LinkID=627338&clcid=0x409' to ''. WARNING: Unable to download the list of available providers. Check your internet connection. PackageManagement\Get-PackageProvider : Unable to find package provider 'NuGet'. It may not be imported yet. Try 'Get-PackageProvider -ListAvailable'. At C:\Program Files\WindowsPowerShell\Modules\PowerShellGet\1.0.0.1\PSModule.psm1:7415 char:30

Transcript stopped, output file is C:\Hercules_Helper_Windows\hercules-helper-windows-master\hercules-helper-20230621_20 -34-05.log Install-Module : NuGet provider is required to interact with NuGet-based repositories. Please ensure that '2.8.5.201' or newer version of NuGet provider is installed. At C:\Hercules_Helper_Windows\hercules-helper-windows-master\hercules-buildall.ps1:368 char:9

PS C:\Hercules_Helper_Windows\hercules-helper-windows-master>

Is there some other place to obtain this 'NuGet provider' ?

Fish-Git commented 1 year ago

@wrljet Bill? HELP!  :(

zVMJedi commented 1 year ago

Meh, it's late; no rush. Probably some PEBCAK on my part that I missed in Bill's tutorial here on the Helper.

UPDATE:  Yep! PEBCAK!

I had to update my Powershell from 2.0 to 5.1, and after update and restart, I FORGOT these two PS commands:

Set-ExecutionPolicy RemoteSigned

and:

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

Now it seems to be cruising along.

wrljet commented 1 year ago

(sigh) I was already half asleep in front of the TV.

Never heard of PEBCAK before. Looks like some sort of 370 assembler macro. :)

Always send the full log. It'll have a filename of this form, with a timestamp: hercules-helper-20230621_20-34-05.log

It's best to zip it first, so weird characters don't get changed by helpful web interfaces.

In case your own build gives up, I've built it with VS2022 and uploaded it here:

Bill

zVMJedi commented 1 year ago

PEBCAK = (P)roblem (E)xists (B)etween (K)eyboard (A)nd (C)hair

wrljet commented 1 year ago

Thank you! Ok, with the reminder that just a week ago(!) I was a mere GitHub Voyeur, so when you say 'do a git pull and rebuild', I'm already lost.

Since Bill Lewis was kind enough to supply the command line he used after he kindly did the "first fix" rebuild for me, is the following command going to do it for me?

    .\hercules-buildall.ps1 -BuildDir C:\xfer\hercules-develop -VS2017 -GitBranch develop

where of course I supply my own desired output directory instead of his C:\xfer\hercules-develop.

Or do I maybe need to use the -GitCommit switch with that "commit number", 587608c?

-GitBranch would pull the latest from that branch.

-GitCommit is more intended to "checkout" an earlier commit. For example if newer commits are known to be faulty.

The script will, however, not overwrite an existing repo on its own.

Once you use it the first time, and everything is primed, it is far easier to git pull and rebuild "by hand".

I'll write that up here for you tomorrow. It's very simple.

Bill

wrljet commented 1 year ago

PEBCAK = (P)roblem (E)xists (B)etween (K)eyboard (A)nd (C)hair

Still sounds like an ASM macro.

zVMJedi commented 1 year ago

My build was successful, and my 2nd-level z/VM 7.2 is cruising along as Member 1 of a 2nd-level SSI!

23:03:06 Q SSI                                                                  
23:03:06 SSI Name: ZVM72SSI                                                     
23:03:06 SSI Mode: Stable                                                       
23:03:06 Cross-System Timeouts: Enabled                                         
23:03:06 SSI Persistent Data Record (PDR) device: VMCOM1 on 0720                
23:03:06 SLOT SYSTEMID STATE     PDR HEARTBEAT       RECEIVED HEARTBEAT         
23:03:06    1 ZVMJEDI1 Joined    06/21/23   23:02:45 06/21/23   23:02:45        
23:03:06    2 ZVMJEDI2 Down (not IPLed)                                         
23:03:06    3 ZVMJEDI3 Down (not IPLed)                                         
23:03:06    4 ZVMJEDI4 Down (not IPLed)                                         
23:03:20 USER DSC   LOGOFF AS  AUTOLOG1 USERS = 8                               

Now for Member(s) 2, 3, and 4!  (but starting tomorrow of course)

Thank you Fish! Thank you Bill! Thank you Aaron! Thank you everybody else that chipped in an idea or a suggestion!

Fish-Git commented 1 year ago

My build was successful, and my 2nd-level z/VM 7.2 is cruising along as Member 1 of a 2nd-level SSI!

Great to hear, Charles! Thanks for the confirmation.

Take care and have fun.  :)

Fish-Git commented 1 year ago

The script will, however, not overwrite an existing repo on its own.

Couldn't it automatically detect that the repo already exists, and if so, simply do a git pull?

Once you use it the first time, and everything is primed, it is far easier to git pull and rebuild "by hand".

I agree 100% with that! The hard part is getting everything setup, and Hercules Helper does that beautifully. From then on pretty much anyone -- even those inexperienced with building Hercules for themselves -- should be able to handle doing a simple git pull and makefile.bat ....

Peter-J-Jansen commented 1 year ago

Congratulations Fish for fixing this tough problem! Yet again another super effort by you tackling this absolutely non-trivial issue!

I have come across a perhaps related problem when trying to install z/VM SSI 2nd level (Wait 902) which I'll re-try and let you know if it was caused by this same issue. (Re-trying this is not so easy as it only occurs at the end of a 9-hour long install upon the first automatic 2nd level re-IPL ...)

Cheers,

Peter

zVMJedi commented 1 year ago

Congratulations Fish for fixing this tough problem! Yet again another super effort by you tackling this absolutely non-trivial issue!

I have come across a perhaps related problem when trying to install z/VM SSI 2nd level (Wait 902) which I'll re-try and let you know if it was caused by this same issue. (Re-trying this is not so easy as it only occurs at the end of a 9-hour long install upon the first automatic 2nd level re-IPL ...)

Cheers,

Peter

Peter! That's what first revealed this problem for me, I got that same Wait 902 during my 2nd-level install attempt. This should fix it. Regards, Charles

wrljet commented 1 year ago

The script will, however, not overwrite an existing repo on its own.

Couldn't it automatically detect that the repo already exists, and if so, simply do a git pull?

-ForceClone switch will overwrite.