SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

Hercules 4.4.1 crashes after OSA failure #489

Closed cfdonatucci closed 2 years ago

cfdonatucci commented 2 years ago

NOTE:  GitHub Issue #458 (Hercules crash after resume from suspend) is also closely related to this issue.


Hi guys,

I had to reinstall my Windows 10 after a problem. After reinstalling Hercules, I'm now having an odd problem not happening before, regarding my external connection to use other Windows applications on the same PC. I reinstalled all Hercules software.

I also ran TTTest64.exe successfully. I have DHCP default conf in my PC. I always used CTCI connection but fails with no error message. So I tried with OSA, and now I get a dump.

z/OS 2.4 starts ok and the OSA is installed and activated ok.

I can start TSO sessions from any PC in my net, I can use CICSPLEX very well using cicsexpl. When I attempt to use a zosexpl session with RSED, the connection crashed and Hercules as well.

Dump available:

I'd appreciate any help!

Regards, Carlos

OSA Adapter

0400.3    QETH  chpid F0 iface D8-5E-D3-81-FE-1D  ipaddr 192.168.1.115  netmask 255.255.0.0

; -------------------------------------------------------------------
; Device to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  DEVICE PORTA MPCIPA
  LINK OSA1 IPAQENET  PORTA
  HOME 192.168.1.115  OSA1
; -------------------------------------------------------------------
; Routes to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  BEGINRoutes
  ; Destination        SubnetMask    FirstHop      LinkName    Size
  ROUTE 192.168.1.0    255.255.255.0    =          OSA1   MTU 1500
  ; Destination                      First Hop     LinkName    Size
  ROUTE DEFAULT                      192.168.1.1   OSA1   MTU 1500
  ENDRoutes
; -------------------------------------------------------------------
; Start  to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  START PORTA
 BROWSE    ADCD.Z24B.VTAMLST(OSATRL2) - 01.01       Line 0000000000
 Command ===>                                                  Scrol
********************************* Top of Data **********************
OSATRL1 VBUILD TYPE=TRL
OSATRL1E TRLE LNCTL=MPC,READ=(0400),WRITE=(0401),DATAPATH=(0402),
               PORTNAME=PORTA,
               MPCLEVEL=QDIO
OSATRL2E TRLE LNCTL=MPC,READ=(0404),WRITE=(0405),DATAPATH=(0406),
               PORTNAME=PORTB,
               MPCLEVEL=QDIO

Hercules log

12:32:36.422 00000788 HHC03800I 0:0401 QETH: Adapter mode set to Layer 3
12:32:36.422 00000788 HHC04100I TunTap64.dll version ** UNPAID TRIAL COPY **  3.7.0.5409 initiated
12:32:36.423 00000788 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.115
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.0.0
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using MTU 1500
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf
12:32:36.499 00000788 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.115
----
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using MAC address 02:00:5e:a3:be:84
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.1.115
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.0.0
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using MTU 1500
--
12:52:49.462 00000788 HHC00822S PROCESSOR CP01 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR CP03 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR IP06 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR IP07 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00823S You have 45 seconds to attach a debugger before crash dump will be taken!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(554)
---
12:53:35.492 00000788                       ***************
12:53:35.492 00000788                       *    OOPS!    *
12:53:35.492 00000788                       ***************
12:53:35.492 00000788                     Hercules has crashed!
12:53:35.492 00000788 (you may or may not need to press ENTER if no 'oops!' dialog-box appears)
12:53:40.404 00000788 Creating crash dump "C:\hercules4.4.1\Hercules.dmp"...
12:53:40.404 00000788 Please wait; this may take a few minutes...
12:53:40.404 00000788 (another message will appear when the dump is complete)
12:53:44.470 00000788 Dump "C:\hercules4.4.1\Hercules.dmp" created.
12:53:44.470 00000788 Please forward the dump to the Hercules team for analysis.
cfdonatucci commented 2 years ago

HOLD THE PRESS!   .......   RSED worked!

  image

cfdonatucci commented 2 years ago

I think it was after setting "Large Send Offload" and "Jumbo Frame" to "Disable". I'm going to check other stuff now, but this is great!

Fish-Git commented 2 years ago

HOLD THE PRESS!   .......   RSED worked!

Fantastic!!   :-D

I think it was after setting "Large Send Offload" and "Jumbo Frame" to "Disable".

More than likely.

I'm going to check other stuff now, but this is great!

Yep!!   :))

Fish-Git commented 2 years ago

Ian said:

Why not use a static IP address?

I agree, Ian. I also prefer using statically assigned IP Addresses too. But if the user's host adapter's IP address is DHCP assigned, the IP address might not be stable. It might change (whereas the MAC address of the adapter never changes).

In my experience however, even if an adapter is DHCP assigned an IP Address, the IP Address that the DHCP server chooses is usually the same one each time, since when the DHCP assigned IP Address's DHCP lease expires, the host usually just asks for it to be renewed (which is usually always granted).

For some odd reason with Carlos however *`()`**, a brand new IP address is being assigned, so a static IP address can't be used in his case.


*`()** _It's probably something Microsoft has changed in Windows. It started with Windows 8, but ever since Windows 10 (and especially with Windows 11), the quality/reliability of Windows has deteriorated quite noticeably IMHO._  :(`

That's why I've personally chosen to stick with Windows 7, which IMHO is the last best Windows operating system ever produced.

cfdonatucci commented 2 years ago

I tried zCEE connection and it worked, and now I am able to deploy services and APIs to Mainframe. Bottom Line: all I need in my LAB is working beautifully! Amazing. Needless to say, you guys are geniuses. I really appreciate so much all your help and patience.

Take Care! Carlos

image

Fish-Git commented 2 years ago

I've changed the status (label) for this GitHub Issue to both "Bug" as well as "Invalid (user error)" too, since, even though the root problem was the result of user error, I believe there might also be a bug in Hercules's "watchdog" thread causing it to mistakenly believe a processor is hung (or still hung) when in actuality it's not actually hung, but rather might just be taking an unusually long time to perform its error recovery.

Anyway, it's something I want to look into.

Fish-Git commented 2 years ago

Carlos (@cfdonatucci),

Would it be possible for you to do a favor for me please? I am having trouble determining where in Hercules it is hanging and crashing, because I am unable to reproduce your problem on my own system. I would therefore very much appreciate if you would run a test for me please.

You will need to temporarily break things on your system by temporarily re-enabling "Large Send Offload" and "Jumbo Frame" by setting them to "Enable".

Once you have set "Large Send Offload" and "Jumbo Frame" to "Enabled", then please start Hercules and perform the same test you did before, but with the following small change:

It is very important to save the Hercules log file! That is the only thing I need!

I do not need the crash dump. When Hercules crashes and it asks you if you want to create a crash dump, you can answer "No". I do not need the crash dump. I only need the Hercules log file.

Afterwards, you can fix things by re-disabling "Large Send Offload" and "Jumbo Frame" by setting them back to "Disable" again.

Thanks.

cfdonatucci commented 2 years ago

Hi sorry i didn't aswer. I was at the office today. Just a question, to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014?

Fish-Git commented 2 years ago

... to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014?

I don't know. It probably doesn't matter. But based solely on the following information, I would probably set it to 9014:

cfdonatucci commented 2 years ago

I set jumbo to 9014 and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail. I think it should be greater than 1500

cfdonatucci commented 2 years ago

But i cannot set it higher than 9014 image

Fish-Git commented 2 years ago

... and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail.

You were supposed to ENABLE Large Send Offload for my test.

Disabling it fixes the problem (prevents the crash).

Enabling Large Send Offload (which is the normal default setting) is what causes the crash.

I need you to ENABLE it, so the problem/crash occurs.

Once you run my test, then you can disable it again so it doesn't crash.

cfdonatucci commented 2 years ago

That's what I did, I enabled it... when I ran the test, I had Large Send Offload enabled and Jumbo set to 9014. The connection succeeded and everything worked. After a while the connection crashed. I see errors in the Hercules console related to QETH adapter, but Hercules didn't crash like before. I'll send the log to you.

cfdonatucci commented 2 years ago

Log attached:

Fish-Git commented 2 years ago

After a while the connection crashed.

What does that mean? "The connection crashed"?

... but Hercules didn't crash like before.

Then the test was no good. We will have to try again.

I am trying to determine the cause of the original Hercules crashing, which you claimed earlier "could be consistently reproduced":

Very simple test:

  • Firewall completely disabled.
  • netsh interface ip delete arpcache executed
  • Just OSA device pointing to 192.168.1.112
  • zOS 2.4 IPLed. Keep in mind that RSED doesn't start with the IPL.
  • OSA started
  • ping 192.168.1.112 ok.
  • Once IPL is finished TXOFF executed.
  • RSED started.
  • connection from zosexplorer -> "remote" system, started
  • in explorer at the bottom line I see message "IZE0100I Connected to 192.168.1.112:4035, fetching children of My Home 100%", meaning the connection was established and it was working.
  • OSA adapter fails.
  • Hercules crashes.

This test can be consistently reproduced.

  • I don't have duplicate MACs
  • I don't have firewall active
  • I used IPs in device definition
  • I cleaned the ARP cache

Documents attached:

  • text capture
  • shark format capture
  • Hercules Log
  • hercules dump

Have a nice weekend.

If Hercules does not crash, then the test is no good. I want to recreate the original conditions that was causing your Hercules to crash, but with my debugging commands active (ptt io qeth 50000, t+400, etc...).

Thanks.

Fish-Git commented 2 years ago

Log attached:

Not helpful.   :(

I need you to recreate the original Hercules crash problem you were experiencing. That is what I am trying to resolve. The "ptt io qeth 50000" and other commands will provide additional information that will help me determine where (and thus hopefully why) Hercules was hanging/crashing.

Thanks.

cfdonatucci commented 2 years ago

Okay, now, I set all parameters for this to work, meaning the parameters we believed resolved the problem. The connection is established. So I did nothing for a while. At some point the adapter started giving "22 Invalid argument" error, and the connection failed. I think we should go back to the network trace.

22:03:22.827 00000AF0 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.112
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.255.0
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MTU 1500
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf
22:03:22.924 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.112
22:03:22.925 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 10.1.10.1
22:03:34.693 00000AF0 HHC00801I Processor CP00: Operation exception code 0001 ilc 4
22:03:34.694 00000AF0 HHC02324I CP00: PSW=0704100080000000 000000001FB179D6 INST=B2AF0000     ????? ,                      
.....
23:38:53.653 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:53.653 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:53.663 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:53.663 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:54.127 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:54.127 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:54.923 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:54.923 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:55.922 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
Fish-Git commented 2 years ago

I would prefer that we recreate the Hercules crash that you originally reported. That is what I am most interested in: preventing Hercules from crashing.

Since we are having trouble accomplishing that however, we might be able to determine a likely cause using your current testing procedure. So let's do this:

Do the same thing you did above (that caused the 22 Invalid argument errors), and then enter the Hercules ptt command whenever it occurs. That will display the ptt trace table which might provide a clue to what's going on:

That should display all of the trace entries in the ptt trace table. THAT is what I need to see.

Save the Hercules logfile and post it here.

Thanks.

cfdonatucci commented 2 years ago

Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:

At this point everything worked perfectly. I couldn't reproduce the failure. I waited for a while, and then went to lunch. When I returned the PC was sleeping. When it was awoken, the connection with zosexplorer failed and the invalid argument error message was issued. I then executed ptt, and after a little while I stopped Hercules.

So now it seems I cannot reproduce what was previously a consistently reproducible error.

Regards.

Fish-Git commented 2 years ago

Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:

<snip>

Regards.

Thanks. Unfortunately however, the test is no good. You didn't do it correctly. You entered the ptt command too soon: you entered it before the 22 Invalid argument error messages were issued, so the ptt results did not show anything unusual.

I need you to wait until the 22 Invalid argument error(s) is/are issued, and then issue the ptt command.

I've also been thinking: if you previously added a Windows Firewall rule, maybe you need to disable it to reproduce the original problem?

Please try again, being careful to issue the ptt command after the HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument error(s) begin occurring.

Thanks.

cfdonatucci commented 2 years ago

I'm going to do the test now. When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload. I disabled the firewall rules. Let's see what happens.

Fish-Git commented 2 years ago

When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload.

What I remember is, originally, you did not have any special Firewall rules defined. Your Windows Firewall was enabled (which is the default for Windows).

I then recommended that you define a new Firewall rule that would allow any/all packets through (effectively disabling your firewall). I don't recall whether you actually added (defined) such a rule to your Windows Firewall or not, but I do remember that you disabled the Windows Firewall completely, since that was the easiest way to eliminate it as a culprit.

It was sometime after that test (which still failed) that Ian Shorter (@mcisho) noticed your Wireshark trace showed packets trying to be sent that were too large for CTCI-WIN (WinPCap) / Hercules to handle. It was then I recommended disabling Large Send Offload, which, yes, did indeed resolve your problem.

So what I am trying to do is to go back to the way you originally had your system configured that was causing Hercules to crash. If we can recreate your original crash again, then we can see where things might be going wrong. That is what the ptt io qeth 50000 command should do: immediately before Hercules crashes, Hercules will automatically issue the ptt command automatically, so we can see what Hercules was doing.

If we cannot recreate the original crash, then we should try to recreate the "Error writing to device tun0: 22 Invalid argument" error instead (with ptt io qeth 50000 active), and then manually enter the ptt command after the "22 Invalid argument" error(s) start to occur.

Thanks.

cfdonatucci commented 2 years ago

I really don't think the firewall is the issue. I ran tests with the firewall completely disabled and they failed. So far I couldn't make it fail. I'm going to run a last test. I even changed RSED mode from 64 bits to 31 and go back, and nothing.

cfdonatucci commented 2 years ago

I was unable to recreate the original error. To recreate "22 invalid argument", I IPLed Hercules as you said. Nothing happened. So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed ptt. I hope this helps. Let me know.

Fish-Git commented 2 years ago

I really don't think the firewall is the issue.

I do.

When you reported your original problem, the Firewall was enabled, and because it was, it was not allowing any incoming connections. That was part of the original problem. Your zosexplorer client was trying to connect to RSED, but Windows wasn't allowing it because it wasn't allowing any outside packets through the firweall. It was only when we disabled the firewall that we then made some progress. Think. How was your Windows system configured when you first reported your original problem? That is the same configuration I would like to try to recreate.

But forget about that right now. Let's move on.

So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed ptt. I hope this helps. Let me know.

Yes, that helped. Thank you.

I would now like you to perform the same test, but this time with the following Hercules commands active:

This should provide additional (hopefully more detailed) debugging information when the "22" errors occur.

What I suspect is happening is, your network adapter (NIC) is being turned off when your PC goes to sleep, but is not being fully powered on after it wakes up. This is a known problem in Windows:

"When deploying Windows 7 or Windows Server 2008 R2, you may want to disable the following network adapter power management setting on some computers:"

Allow the computer to turn off this device to save power

(emphasis mine)

And:

"Windows never turns off the network card due to inactivity. When this setting is checked (enabled), Windows puts the network card to sleep, and when it resumes, it puts it back to D0. When this setting isn't checked (disabled), Windows completely halts the device, and on resume reinitializes it."

(emphasis mine)

I also found the following web page as well:

"By trial and error and experimentation, I have determined that if I uncheck that box, the issue disappears."

(and, at the very bottom of the following page):

Configuring Power Management (Windows Operating System Only)

"For image streaming applications that run for an extended period of time, we recommend that you change the Windows power plan settings to ensure the computer and NIC never go to sleep. If you do not change the power plan settings, the computer or NIC may go to sleep when there is no activity from the keyboard or mouse for certain amount of time and the image stream may be interrupted."

(and):

To configure the power settings for the NIC

  1. In the Windows Control Panel, click Network and Internet.
  2. Click Network and Sharing Center.
  3. In the left-hand panel, click Change adapter settings.
  4. Right-click the NIC and then click Properties.
  5. Click Configure.
  6. Configure the power management settings for the NIC to ensure the computer does not turn off the device to save power.
  7. Click OK. Repeat steps 4-7 for each NIC on your computer.

(emphasis mine)

Bottom line: everywhere I look, disabling (unchecking) the "Allow the computer to turn off this device to save power" option fixes the problem.

But I would like to see another test like you just did, but with tt32 debug enabled as well. I'm hoping maybe it might confirm that the reason for the "22 invalid argument" errors is because the adapter isn't fully powered on after waking up from a sleep.

And if that is true, there is nothing Hercules nor CTCI-WIN can do to fix or workaround the problem. It is just another "User Error" (i.e. not having your system configured properly).

Thanks.


MORE INFO:

nic-properties

nic-configure

nic-power

cfdonatucci commented 2 years ago

Hi I ran the test and got your trace:

Afterwards, I unchecked that box and restarted Hercules and z/OS. After a while, I put the PC to sleep. When it restarted, I didn't get any 22 errors, but the adapter stopped working. Apparently the suspend introduces some problem. Anyway, I set my PC to never sleeps, so I think we are okay unless you want another test.

Fish-Git commented 2 years ago

Hi I ran the test and got your trace:

Unfortunately, your .zip file is completely empty. It contains .... nothing. No files at all.

cfdonatucci commented 2 years ago

Try this one:

Fish-Git commented 2 years ago

Try this one:

Wow. I wasn't expecting that. Not very helpful at all, but not your fault of course.

It looks like Windows's re-initialization of the network adapter upon resume after suspension ends up killing Hercules's guest's virtual network adapter. I'm guessing that CTCI-WIN's WinPCap sniffing/injection hook on Windows's real physical adapter ends up being invalidated/removed/rendered inaccessible resulting in a permanent loss of connectivity between Hercules (CTCI-WIN/WinPCap) and the adapter. Most unfortunate.

So it looks like setting one's PC to never sleep (which you have already done) is the only solution available to us. There doesn't appear to be any way for us (Hercules/CTCI-WIN/WinPCap) to recover from this.

Nevertheless, since my ultimate goal for this issue was not to try to overcome that problem, but rather to try to prevent Hercules from crashing as a result of the situation, I have added some additional code to Hercules (commit 96b606adad02296f2e11a8252ea9c9d7ccc1d6f8) to try and better detect when a resume from suspend (awake from sleep or re-power-on after hibernation) has occurred in order to prevent our watchdog thread from erroneously detecting what it believes is a malfunctioning CPU.

Hopefully, this new code, along with our previous attempt to address the same issue (commit 4b439ae673d776ecaa35ed0aa87bb73f940e904e), will, together, prevent Hercules from crashing upon resume from suspend.

<me: fingers crossed>

As I believe we have done all that we can with this issue, I am going to close it at this time. I would like to thank you, Carlos, for your incredible patience. Take care.

p.s. to Mark G. (@dasdman): I apologize for not having implemented your earlier suggestion sooner. At the time, I didn't think it was necessary!

cfdonatucci commented 2 years ago

I'm glad this was useful. It was great for me to contribute with you guys, I really admire what you do. Let me know any way I can do anything in the future, tests or whatever. I'm not at your level but I'll try to do my best. See you! Carlos