Closed cfdonatucci closed 2 years ago
HOLD THE PRESS! ....... RSED worked!
I think it was after setting "Large Send Offload" and "Jumbo Frame" to "Disable". I'm going to check other stuff now, but this is great!
HOLD THE PRESS! ....... RSED worked!
Fantastic!! :-D
I think it was after setting "Large Send Offload" and "Jumbo Frame" to "Disable".
More than likely.
I'm going to check other stuff now, but this is great!
Yep!! :))
Ian said:
Why not use a static IP address?
I agree, Ian. I also prefer using statically assigned IP Addresses too. But if the user's host adapter's IP address is DHCP assigned, the IP address might not be stable. It might change (whereas the MAC address of the adapter never changes).
In my experience however, even if an adapter is DHCP assigned an IP Address, the IP Address that the DHCP server chooses is usually the same one each time, since when the DHCP assigned IP Address's DHCP lease expires, the host usually just asks for it to be renewed (which is usually always granted).
For some odd reason with Carlos however *`()`**, a brand new IP address is being assigned, so a static IP address can't be used in his case.
*`()** _It's probably something Microsoft has changed in Windows. It started with Windows 8, but ever since Windows 10 (and especially with Windows 11), the quality/reliability of Windows has deteriorated quite noticeably IMHO._
:(`
That's why I've personally chosen to stick with Windows 7, which IMHO is the last best Windows operating system ever produced.
I tried zCEE connection and it worked, and now I am able to deploy services and APIs to Mainframe. Bottom Line: all I need in my LAB is working beautifully! Amazing. Needless to say, you guys are geniuses. I really appreciate so much all your help and patience.
Take Care! Carlos
I've changed the status (label) for this GitHub Issue to both "Bug" as well as "Invalid (user error)" too, since, even though the root problem was the result of user error, I believe there might also be a bug in Hercules's "watchdog" thread causing it to mistakenly believe a processor is hung (or still hung) when in actuality it's not actually hung, but rather might just be taking an unusually long time to perform its error recovery.
Anyway, it's something I want to look into.
Carlos (@cfdonatucci),
Would it be possible for you to do a favor for me please? I am having trouble determining where in Hercules it is hanging and crashing, because I am unable to reproduce your problem on my own system. I would therefore very much appreciate if you would run a test for me please.
You will need to temporarily break things on your system by temporarily re-enabling "Large Send Offload" and "Jumbo Frame" by setting them to "Enable".
Once you have set "Large Send Offload" and "Jumbo Frame" to "Enabled", then please start Hercules and perform the same test you did before, but with the following small change:
ptt io qeth 50000
t+400
t+401
t+402
Then connect from zosexplorer like before. Hercules will hopefully crash just like it did before, but this time some additional information should be displayed in the Hercules log.
It is very important to save the Hercules log file! That is the only thing I need!
I do not need the crash dump. When Hercules crashes and it asks you if you want to create a crash dump, you can answer "No". I do not need the crash dump. I only need the Hercules log file.
Afterwards, you can fix things by re-disabling "Large Send Offload" and "Jumbo Frame" by setting them back to "Disable" again.
Thanks.
Hi sorry i didn't aswer. I was at the office today. Just a question, to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014?
... to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014?
I don't know. It probably doesn't matter. But based solely on the following information, I would probably set it to 9014:
I set jumbo to 9014 and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail. I think it should be greater than 1500
But i cannot set it higher than 9014
... and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail.
You were supposed to ENABLE Large Send Offload for my test.
Disabling it fixes the problem (prevents the crash).
Enabling Large Send Offload (which is the normal default setting) is what causes the crash.
I need you to ENABLE it, so the problem/crash occurs.
Once you run my test, then you can disable it again so it doesn't crash.
That's what I did, I enabled it... when I ran the test, I had Large Send Offload enabled and Jumbo set to 9014. The connection succeeded and everything worked. After a while the connection crashed. I see errors in the Hercules console related to QETH adapter, but Hercules didn't crash like before. I'll send the log to you.
Log attached:
After a while the connection crashed.
What does that mean? "The connection crashed"?
... but Hercules didn't crash like before.
Then the test was no good. We will have to try again.
I am trying to determine the cause of the original Hercules crashing, which you claimed earlier "could be consistently reproduced":
Very simple test:
- Firewall completely disabled.
- netsh interface ip delete arpcache executed
- Just OSA device pointing to 192.168.1.112
- zOS 2.4 IPLed. Keep in mind that RSED doesn't start with the IPL.
- OSA started
- ping 192.168.1.112 ok.
- Once IPL is finished TXOFF executed.
- RSED started.
- connection from zosexplorer -> "remote" system, started
- in explorer at the bottom line I see message "IZE0100I Connected to 192.168.1.112:4035, fetching children of My Home 100%", meaning the connection was established and it was working.
- OSA adapter fails.
- Hercules crashes.
This test can be consistently reproduced.
- I don't have duplicate MACs
- I don't have firewall active
- I used IPs in device definition
- I cleaned the ARP cache
Documents attached:
- text capture
- shark format capture
- Hercules Log
- hercules dump
Have a nice weekend.
If Hercules does not crash, then the test is no good. I want to recreate the original conditions that was causing your Hercules to crash, but with my debugging commands active (ptt io qeth 50000, t+400, etc...).
Thanks.
Log attached:
Not helpful. :(
I need you to recreate the original Hercules crash problem you were experiencing. That is what I am trying to resolve. The "ptt io qeth 50000" and other commands will provide additional information that will help me determine where (and thus hopefully why) Hercules was hanging/crashing.
Thanks.
Okay, now, I set all parameters for this to work, meaning the parameters we believed resolved the problem. The connection is established. So I did nothing for a while. At some point the adapter started giving "22 Invalid argument" error, and the connection failed. I think we should go back to the network trace.
22:03:22.827 00000AF0 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.112 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.255.0 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MTU 1500 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf 22:03:22.924 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.112 22:03:22.925 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 10.1.10.1 22:03:34.693 00000AF0 HHC00801I Processor CP00: Operation exception code 0001 ilc 4 22:03:34.694 00000AF0 HHC02324I CP00: PSW=0704100080000000 000000001FB179D6 INST=B2AF0000 ????? , ..... 23:38:53.653 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:53.653 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:53.663 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:53.663 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:54.127 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:54.127 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:54.923 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:54.923 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:55.922 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
I would prefer that we recreate the Hercules crash that you originally reported. That is what I am most interested in: preventing Hercules from crashing.
Since we are having trouble accomplishing that however, we might be able to determine a likely cause using your current testing procedure. So let's do this:
Do the same thing you did above (that caused the 22 Invalid argument
errors), and then enter the Hercules ptt
command whenever it occurs. That will display the ptt trace table which might provide a clue to what's going on:
ptt io qeth 50000
s txoff
s rsed
HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
errors to begin occurringptt
to dump the trace table (i.e. just "ptt" by itself, without any arguments)That should display all of the trace entries in the ptt trace table. THAT is what I need to see.
Save the Hercules logfile and post it here.
Thanks.
Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:
At this point everything worked perfectly. I couldn't reproduce the failure. I waited for a while, and then went to lunch. When I returned the PC was sleeping. When it was awoken, the connection with zosexplorer failed and the invalid argument error message was issued. I then executed ptt
, and after a little while I stopped Hercules.
So now it seems I cannot reproduce what was previously a consistently reproducible error.
Regards.
Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:
<snip>
Regards.
Thanks. Unfortunately however, the test is no good. You didn't do it correctly. You entered the ptt
command too soon: you entered it before the 22 Invalid argument
error messages were issued, so the ptt results did not show anything unusual.
I need you to wait until the 22 Invalid argument
error(s) is/are issued, and then issue the ptt
command.
I've also been thinking: if you previously added a Windows Firewall rule, maybe you need to disable it to reproduce the original problem?
Please try again, being careful to issue the ptt
command after the HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
error(s) begin occurring.
Thanks.
I'm going to do the test now. When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload. I disabled the firewall rules. Let's see what happens.
When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload.
What I remember is, originally, you did not have any special Firewall rules defined. Your Windows Firewall was enabled (which is the default for Windows).
I then recommended that you define a new Firewall rule that would allow any/all packets through (effectively disabling your firewall). I don't recall whether you actually added (defined) such a rule to your Windows Firewall or not, but I do remember that you disabled the Windows Firewall completely, since that was the easiest way to eliminate it as a culprit.
It was sometime after that test (which still failed) that Ian Shorter (@mcisho) noticed your Wireshark trace showed packets trying to be sent that were too large for CTCI-WIN (WinPCap) / Hercules to handle. It was then I recommended disabling Large Send Offload, which, yes, did indeed resolve your problem.
So what I am trying to do is to go back to the way you originally had your system configured that was causing Hercules to crash. If we can recreate your original crash again, then we can see where things might be going wrong. That is what the ptt io qeth 50000
command should do: immediately before Hercules crashes, Hercules will automatically issue the ptt
command automatically, so we can see what Hercules was doing.
If we cannot recreate the original crash, then we should try to recreate the "Error writing to device tun0: 22 Invalid argument" error instead (with ptt io qeth 50000
active), and then manually enter the ptt
command after the "22 Invalid argument" error(s) start to occur.
Thanks.
I really don't think the firewall is the issue. I ran tests with the firewall completely disabled and they failed. So far I couldn't make it fail. I'm going to run a last test. I even changed RSED mode from 64 bits to 31 and go back, and nothing.
I was unable to recreate the original error. To recreate "22 invalid argument", I IPLed Hercules as you said. Nothing happened. So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed ptt
. I hope this helps. Let me know.
I really don't think the firewall is the issue.
I do.
When you reported your original problem, the Firewall was enabled, and because it was, it was not allowing any incoming connections. That was part of the original problem. Your zosexplorer client was trying to connect to RSED, but Windows wasn't allowing it because it wasn't allowing any outside packets through the firweall. It was only when we disabled the firewall that we then made some progress. Think. How was your Windows system configured when you first reported your original problem? That is the same configuration I would like to try to recreate.
But forget about that right now. Let's move on.
So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed
ptt
. I hope this helps. Let me know.
Yes, that helped. Thank you.
I would now like you to perform the same test, but this time with the following Hercules commands active:
This should provide additional (hopefully more detailed) debugging information when the "22" errors occur.
What I suspect is happening is, your network adapter (NIC) is being turned off when your PC goes to sleep, but is not being fully powered on after it wakes up. This is a known problem in Windows:
"When deploying Windows 7 or Windows Server 2008 R2, you may want to disable the following network adapter power management setting on some computers:"
Allow the computer to turn off this device to save power
(emphasis mine)
And:
"Windows never turns off the network card due to inactivity. When this setting is checked (enabled), Windows puts the network card to sleep, and when it resumes, it puts it back to D0. When this setting isn't checked (disabled), Windows completely halts the device, and on resume reinitializes it."
(emphasis mine)
I also found the following web page as well:
"By trial and error and experimentation, I have determined that if I uncheck that box, the issue disappears."
(and, at the very bottom of the following page):
Configuring Power Management (Windows Operating System Only)
"For image streaming applications that run for an extended period of time, we recommend that you change the Windows power plan settings to ensure the computer and NIC never go to sleep. If you do not change the power plan settings, the computer or NIC may go to sleep when there is no activity from the keyboard or mouse for certain amount of time and the image stream may be interrupted."
(and):
To configure the power settings for the NIC
- In the Windows Control Panel, click Network and Internet.
- Click Network and Sharing Center.
- In the left-hand panel, click Change adapter settings.
- Right-click the NIC and then click Properties.
- Click Configure.
- Configure the power management settings for the NIC to ensure the computer does not turn off the device to save power.
- Click OK. Repeat steps 4-7 for each NIC on your computer.
(emphasis mine)
Bottom line: everywhere I look, disabling (unchecking) the "Allow the computer to turn off this device to save power" option fixes the problem.
But I would like to see another test like you just did, but with tt32 debug
enabled as well. I'm hoping maybe it might confirm that the reason for the "22 invalid argument" errors is because the adapter isn't fully powered on after waking up from a sleep.
And if that is true, there is nothing Hercules nor CTCI-WIN can do to fix or workaround the problem. It is just another "User Error" (i.e. not having your system configured properly).
Thanks.
MORE INFO:
Hi I ran the test and got your trace:
Afterwards, I unchecked that box and restarted Hercules and z/OS. After a while, I put the PC to sleep. When it restarted, I didn't get any 22 errors, but the adapter stopped working. Apparently the suspend introduces some problem. Anyway, I set my PC to never sleeps, so I think we are okay unless you want another test.
Hi I ran the test and got your trace:
Unfortunately, your .zip file is completely empty. It contains .... nothing. No files at all.
Try this one:
Try this one:
Wow. I wasn't expecting that. Not very helpful at all, but not your fault of course.
It looks like Windows's re-initialization of the network adapter upon resume after suspension ends up killing Hercules's guest's virtual network adapter. I'm guessing that CTCI-WIN's WinPCap sniffing/injection hook on Windows's real physical adapter ends up being invalidated/removed/rendered inaccessible resulting in a permanent loss of connectivity between Hercules (CTCI-WIN/WinPCap) and the adapter. Most unfortunate.
So it looks like setting one's PC to never sleep (which you have already done) is the only solution available to us. There doesn't appear to be any way for us (Hercules/CTCI-WIN/WinPCap) to recover from this.
Nevertheless, since my ultimate goal for this issue was not to try to overcome that problem, but rather to try to prevent Hercules from crashing as a result of the situation, I have added some additional code to Hercules (commit 96b606adad02296f2e11a8252ea9c9d7ccc1d6f8) to try and better detect when a resume from suspend (awake from sleep or re-power-on after hibernation) has occurred in order to prevent our watchdog thread from erroneously detecting what it believes is a malfunctioning CPU.
Hopefully, this new code, along with our previous attempt to address the same issue (commit 4b439ae673d776ecaa35ed0aa87bb73f940e904e), will, together, prevent Hercules from crashing upon resume from suspend.
<me: fingers crossed>
As I believe we have done all that we can with this issue, I am going to close it at this time. I would like to thank you, Carlos, for your incredible patience. Take care.
p.s. to Mark G. (@dasdman): I apologize for not having implemented your earlier suggestion sooner. At the time, I didn't think it was necessary!
I'm glad this was useful. It was great for me to contribute with you guys, I really admire what you do. Let me know any way I can do anything in the future, tests or whatever. I'm not at your level but I'll try to do my best. See you! Carlos
NOTE: GitHub Issue #458 (Hercules crash after resume from suspend) is also closely related to this issue.
Hi guys,
I had to reinstall my Windows 10 after a problem. After reinstalling Hercules, I'm now having an odd problem not happening before, regarding my external connection to use other Windows applications on the same PC. I reinstalled all Hercules software.
I also ran TTTest64.exe successfully. I have DHCP default conf in my PC. I always used CTCI connection but fails with no error message. So I tried with OSA, and now I get a dump.
z/OS 2.4 starts ok and the OSA is installed and activated ok.
I can start TSO sessions from any PC in my net, I can use CICSPLEX very well using
cicsexpl
. When I attempt to use azosexpl
session with RSED, the connection crashed and Hercules as well.Dump available:
I'd appreciate any help!
Regards, Carlos
OSA Adapter
Hercules log