SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

Hercules 4.4.1 crashes after OSA failure #489

Closed cfdonatucci closed 2 years ago

cfdonatucci commented 2 years ago

NOTE:  GitHub Issue #458 (Hercules crash after resume from suspend) is also closely related to this issue.


Hi guys,

I had to reinstall my Windows 10 after a problem. After reinstalling Hercules, I'm now having an odd problem not happening before, regarding my external connection to use other Windows applications on the same PC. I reinstalled all Hercules software.

I also ran TTTest64.exe successfully. I have DHCP default conf in my PC. I always used CTCI connection but fails with no error message. So I tried with OSA, and now I get a dump.

z/OS 2.4 starts ok and the OSA is installed and activated ok.

I can start TSO sessions from any PC in my net, I can use CICSPLEX very well using cicsexpl. When I attempt to use a zosexpl session with RSED, the connection crashed and Hercules as well.

Dump available:

I'd appreciate any help!

Regards, Carlos

OSA Adapter

0400.3    QETH  chpid F0 iface D8-5E-D3-81-FE-1D  ipaddr 192.168.1.115  netmask 255.255.0.0

; -------------------------------------------------------------------
; Device to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  DEVICE PORTA MPCIPA
  LINK OSA1 IPAQENET  PORTA
  HOME 192.168.1.115  OSA1
; -------------------------------------------------------------------
; Routes to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  BEGINRoutes
  ; Destination        SubnetMask    FirstHop      LinkName    Size
  ROUTE 192.168.1.0    255.255.255.0    =          OSA1   MTU 1500
  ; Destination                      First Hop     LinkName    Size
  ROUTE DEFAULT                      192.168.1.1   OSA1   MTU 1500
  ENDRoutes
; -------------------------------------------------------------------
; Start  to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  START PORTA
 BROWSE    ADCD.Z24B.VTAMLST(OSATRL2) - 01.01       Line 0000000000
 Command ===>                                                  Scrol
********************************* Top of Data **********************
OSATRL1 VBUILD TYPE=TRL
OSATRL1E TRLE LNCTL=MPC,READ=(0400),WRITE=(0401),DATAPATH=(0402),
               PORTNAME=PORTA,
               MPCLEVEL=QDIO
OSATRL2E TRLE LNCTL=MPC,READ=(0404),WRITE=(0405),DATAPATH=(0406),
               PORTNAME=PORTB,
               MPCLEVEL=QDIO

Hercules log

12:32:36.422 00000788 HHC03800I 0:0401 QETH: Adapter mode set to Layer 3
12:32:36.422 00000788 HHC04100I TunTap64.dll version ** UNPAID TRIAL COPY **  3.7.0.5409 initiated
12:32:36.423 00000788 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.115
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.0.0
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using MTU 1500
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf
12:32:36.499 00000788 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.115
----
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using MAC address 02:00:5e:a3:be:84
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.1.115
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.0.0
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using MTU 1500
--
12:52:49.462 00000788 HHC00822S PROCESSOR CP01 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR CP03 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR IP06 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR IP07 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00823S You have 45 seconds to attach a debugger before crash dump will be taken!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(554)
---
12:53:35.492 00000788                       ***************
12:53:35.492 00000788                       *    OOPS!    *
12:53:35.492 00000788                       ***************
12:53:35.492 00000788                     Hercules has crashed!
12:53:35.492 00000788 (you may or may not need to press ENTER if no 'oops!' dialog-box appears)
12:53:40.404 00000788 Creating crash dump "C:\hercules4.4.1\Hercules.dmp"...
12:53:40.404 00000788 Please wait; this may take a few minutes...
12:53:40.404 00000788 (another message will appear when the dump is complete)
12:53:44.470 00000788 Dump "C:\hercules4.4.1\Hercules.dmp" created.
12:53:44.470 00000788 Please forward the dump to the Hercules team for analysis.
Fish-Git commented 2 years ago

A couple of things right off the bat:

  1. I would prefer seeing a complete Hercules configuration file rather than just a single device statement. There could be other valuable information in there that might help explain things.

  2. I would prefer to see a complete Hercules log file (from beginning to end) rather than the small extract your provided. There could be valuable information in the log that might help explain what's going on (or what happened).

  3. You said "I also ran TTTest64.exe successfully", but simply starting TTTest64 and then not doing anything is not enough. After starting TTTest64 you need to actually perform a test! More specifically, a "Multi-directional Ping Test", as explained in the CTCI-WIN documentation (Help file). You also failed to show us your TTTest64 test output too. It contains valuable information about your host networking setup.

  4. On your Hercules device statement, you specified netmask 255.255.0.0, but in your z/OS TCP config you specified 255.255.255.0.

Most of these things are mentioned in our "SUBMITTING PROBLEM REPORTS" document. Please review it and then submit all needed additional information. Thanks.

In the mean time I'll take a peek at your dump to see whether anything jumps out at me.

Fish-Git commented 2 years ago

Dump was worthless.

I need to see the complete Hercules log file, configuration file, and TTTest64 output.

Explaining in more detail what you were doing (or trying to do) when the problem occurred might help too.

p.s. Did you disable Unconstrained Transactions (i.e. run TXOFF) before doing whatever it was you were attempting to do, like you did previously?

cfdonatucci commented 2 years ago

Hi, thank you for your answer. The netmask was fixed and everything redone. It failed again.

I've attached the entire log, second dump, TTTest64 report and entire config file:

I also ran TXOFF before doing whatever I do. The only thing not clear to me is when TXON should be executed.

I'm testing with OSA because it failed with CTCI as well. In general Hercules works pretty fine. The issue only occurs when I want to access PC applications to exploit mainframe facilities.

Activities:

I'm using all products included in Hercules, but for example I've seen that WinPCap is not supported and Npcap is recommended. Don't know if that may be related.

Let me know anything else you need.

Regards Carlos

Fish-Git commented 2 years ago

I'm going to need some help reproducing this. I have almost zero z/OS skills.

I do have z/OS 2.4B, and it seems to IPL and run just fine, but for all of my z/OS Hercules IPL tests, I always use loadparm WSM (WS = CLPA and Warm start of JES2. Base z/OS system functions i.e. no CICS, DB2, IMS, etc. M=Verbose IPL Messages), not CI.

When I just now tried IPLing my system (which I believe is an ADCD system) using loadparm CIM (CI = CLPA and Warm start of JES2. Loads CICS 5.3 and 5.2 libraries. Starts CICS 5.3, z/OSMF, and IBM Developer for z Systems. M=Verbose IPL Messages), it's asking me:

    IGGN505A SPECIFY UNIT FOR DFH540.CICS.SDFHLPA ON B4C541 OR CANCEL

which I don't know how to respond to.   :(

I also don't know what cicsexpl is, nor what a zosexpl session with RSED is either.   :(

My system is configured to use OSA and is configured almost identically as yours.

Can you explain to me (in simple terms please! I'm not a z/OS person!) what I need to do to reproduce your problem?

Thanks.


p.s. I suspect something might be wrong with your local network configuration. Your TTTest64 report does not look right. Your first ping of www.linux.org using a tun, reported a ping response time of "time=16ms", which is quite reasonable. But then you closed your tun interface and tried the same thing again using a tap interface instead, and this time all of your ping responses were all "time=<1ms"!! I am seriously doubting you can ping www.linux.org from your Windows system in less than 1ms!!

Can you provide some more details regarding your local networking? Thanks.

Fish-Git commented 2 years ago

Can you provide some more details regarding your local networking? Thanks.

And it might be a good idea to try your TTTest64 Ping Test again, but this time without using a tun interface beforehand. Do the test right away using only a tap defined interface.

It might not hurt to clear your ARP cache beforehand too, or even IPL your Windows system.

You said you had to reinstall Windows 10, and as I recall, Windows 10 did have some type of bug in its networking handling at some point in the distant past that was fixed with an update. Did you (re-)install all of your Windows/Security updates after you reinstalled Windows? (and before doing your Hercules test?)

cfdonatucci commented 2 years ago

Hi, Regarding the message, it is just because the CICS 54 DFHLPA maybe in the lpalist, but the lib doesn't exist. In such cases you can reply 0,cancel. All consoles messages at that stage must be replied with 0,something. Sorry if this it obvious.

zosexplorer is an IBM eclipse application working as a framework for other applications included as plugins, like cics explorer, git interfaces, zos connect enterprise edition, zosexplorer, dbb for zDevOps and others. I start my ADCD z/OS 2.4 with CI as well.

To reproduce my problem you could use zosexplorer. You need two started tasks: jmon and rsed, which I believe are started by default and define a connection to the default eclipse zos explorer port, port 4035.

With my previous Windows I used to have a fixed IP address 192.168.1.110 with a CTCI definition. Unfortunately I didn't keep record of that configuration. So I tried to do the same and now I have this problem. This worked in my other Windows. So I switched to an OSA definition, hoping it would help, but it didn't.

My local networking has DHCP, no other configuration. I erased the ARP cache and re-ran TTTest64 Ping Test:

Fish-Git commented 2 years ago

Regarding the message, it is just because the CICS 54 DFHLPA maybe in the lpalist, but the lib doesn't exist. In such cases you can reply 0,cancel.

Thanks. That seems to have worked. I had to reply to it twice though. After the first reply, the same(?) message appeared again a minute or so later. After I replied to the second message, the system finally finished IPLing and is now running normally(?).

To reproduce my problem you could use zosexplorer.

Which I know nothing about.  :(

You need two started tasks: jmon and rsed...

Which I don't know how to do.  :(

...which I believe are started by default...

Good!!   :)

...and define a connection to the default eclipse zos explorer port, port 4035.

  1. How can I tell whether zosexplorer (i.e. jmon and rsed??) is finished and ready for work? After IPLing my system, all processors are running at nearly 100%. How do I know when everything is ready for me to begin my testing? How long (approximately) do I have to wait before trying my test?

  2. What do I do with port 4035? Do I connect a terminal to it? What type of terminal? 3270? Or simple command line?

  3. Once connected(?), THEN what do I do? Do I simply enter some type of command? What command do I enter? What do I need to try doing to reproduce the crash?

Sorry for the stupid questions, but as I explained, I know almost nothing about z/OS!  :(

Thanks.

Fish-Git commented 2 years ago
  1. How can I tell whether zosexplorer (i.e. jmon and rsed??) is finished and ready for work?

FYI: When I press PF10 on my master console (to issue the D A,L command), I do see both JMON and RSED in the list of active tasks. Does that mean they're both ready? Does that mean I can do my test? (whatever that test is! I don't even know what I'm supposed to be doing!). Thanks.

cfdonatucci commented 2 years ago

one question, do you have a TSO session or just the console?

Fish-Git commented 2 years ago

one question, do you have a TSO session or just the console?

TSO session too. I should tell you I DO know how to do a little teensy tiny bit. I know how to logon. I know how to use ISPF 6 to issue 'ping' and other commands. I know how to browse dataset/edit dataset members, submit jobs (although I do NOT know JCL!), look at printouts, etc. And I know how to cleanly shut the system down. But nothing beyond that,

cfdonatucci commented 2 years ago

ok, I can guide you to download zos explorer from IBM site, how to configure a connection to Mainframe and how it's invoked if you are willing to do it. I'll have to take a series of screenshots and send them to you. Please confirm if you want. additionally, do you have TXOFF/ON in your zos?

Fish-Git commented 2 years ago

...if you are willing to do it.

Yes, certainly!

I'll have to take a series of screenshots and send them to you.

That's fine. My email address is fish at either softdevlabs.com or infidels.org.

additionally, do you have TXOFF/ON in your zos?

Not yet, no. But I do have a copy of Jürgen's TXONOFF job stream, so as far as I know all I have to do is run it, and then it'll be on my system. Yes? And then all I need to do is enter the command 'txoff' or 'txon' from ISPF 6 to disable or enable unconstrained transactions, yes?

cfdonatucci commented 2 years ago

yes to both... i'm writing the instructions. I'll send them as soon as i can. Thank you very much.

Fish-Git commented 2 years ago

CORRECTION:   I just found my notes regarding TXONOFF:

--------------------------------------------------------------------------------

                    Disable/Enable UNconstrained transactions

Upload Jürgen's TXONOFF to IBMUSER adcd.lib.jcl and run it.

(Be careful during upload! The job stream is CASE SENSITIVE!)

Then to activate, either re-IPL, or issue console command "F LLA,REFRESH"

Then to switch unconstrained transactions OFF/ON
(FOR ALL SUBSEQUENTLY STARTED jobs!), simply do:

  "S TXOFF" at the console   (or "TXOFF" from a shell prompt)
  "S TXON"  at the console   (or "TXON"  from a shell prompt)

Example:

  alias cls=clear

  export PATH=$PATH:$JAVA_HOME/bin

  TXOFF

  time java com.ibm.jvm.format.TraceFormat test.trc
  time java com.ibm.jvm.format.TraceFormat test.trc > test.trc.fmt 2>&1

--------------------------------------------------------------------------------

Is that correct?

Fish-Git commented 2 years ago

i'm writing the instructions. I'll send them as soon as i can.

Thanks! Standing by...

Fish-Git commented 2 years ago

FYI: TXONOFF is now on my system, and entering s txoff from the master console resulted in:

   - 17.24.51           s txoff
   - 17.24.51           IRR812I PROFILE * (G) IN THE STARTED CLASS WAS USED
   -         TO START TXOFF WITH JOBNAME TXOFF.
   - 17.24.51 STC00442  $HASP373 TXOFF    STARTED
   - 17.24.52 STC00442  IEF404I TXOFF - ENDED - TIME=17.24.52

So I think we're good to go.

Standing by for further instructions via email...

cfdonatucci commented 2 years ago

mail sent, let me know if you got it.

Fish-Git commented 2 years ago

Nope. Not yet.  :(

What email address did you send it to?

cfdonatucci commented 2 years ago

fishspan>@softdevlabs.com</span

cfdonatucci commented 2 years ago

Hi, I'll attach the file here. Regards.

cfdonatucci commented 2 years ago

Hi I did two things:

  1. Started with CTCI support with debug. I reproduced the error, but saw nothing related to CTCI.
  2. Started with OSA support with debug. When I tried to connect to RSED, I got this:
12:07:23.003 0000302C HHC03991D 0:0401 QETH: RRH_TYPE_ULP: PUK_TYPE_DISABLE (ULP_DISABLE): Request
12:07:23.003 0000302C HHC03981D 0:0401 QETH: TH : +0000< 00E00000 0000001B 00000014 00000055  ...............U  .\..............
12:07:23.003 0000302C HHC03981D 0:0401 QETH: TH : +0010< 10000001                             ....              ....
12:07:23.003 0000302C HHC03981D 0:0401 QETH: RRH: +0000< 00000000 417E0001 00000004 00000003  ....A~..........  .....=..........
12:07:23.003 0000302C HHC03981D 0:0401 QETH: RRH: +0010< 00240015 00001505 D8C5E3F3 00000000  .$..............  ........QET3....
12:07:23.003 0000302C HHC03981D 0:0401 QETH: RRH: +0020< 00000000                             ....              ....
12:07:23.003 0000302C HHC03981D 0:0401 QETH: PH : +0000< 01000015 00000040                    .......@          .......
12:07:23.003 0000302C HHC03981D 0:0401 QETH: PUK: +0000< 000C4103 00090000 00000000           ..A.........      ............
12:07:23.003 0000302C HHC03981D 0:0401 QETH: PUS: +0000< 00090403 05000101 16                 .........         .........
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using MAC address 02:00:5e:a3:be:84
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.1.115
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.255.0
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using MTU 1500
12:07:23.012 0000302C HHC03991D 0:0402 QETH: Halting data device

Entire log attached.

I hope this helps.

Fish-Git commented 2 years ago

What email address did you send it to? fishspan>@</spansoftdevlabs.com

Weird. I never receive it.

Hi, I'll attach the file here. Regards.

Thanks. I downloaded it and tried it it yesterday after making the following changes:

  1. I ran IBM Explorer for z/OS from another system on my local network: a Windows 7 x64 VMware virtual machine, but the fact that it was a virtual machine shouldn't make any difference. As far as my Windows 7 host was concerned, it was another system on the local network).

  2. I added a new Inbound rule to my Windows Firewall to let just TCP port 4035 through. That didn't seem to work, so I changed it to Protocol = Any instead, which of course removed the port number restriction, effectively disabling my firewall entirely (i.e. letting anyone connect to anything from anywhere), and that did work:

firewall1 firewall2 firewall3

As you can see below, I was able to connect and things worked just fine (although I didn't know what the heck I was doing! I'm not familiar with IBM Explorer for z/OS!):

zosexplorer

I was however able to view files and printouts! Pretty cool!

One thing I did notice was that sometimes the connection would fail on the first attempt. But if I tried again it would work the second time around. I'm not sure what that means, if anything.

I also do not use DHCP. I have all of my systems are hard coded with their own uniquely assigned IP addresses, so it wasn't exactly a fair test. My initial goal wasn't to try and exactly reproduce your problem, but rather just to see if I could get it to work, and I succeeded in that endeavor.

I also have Checksum Offloading overridden in CTCI-WIN too:

Checksum offloading = OVERRIDDEN

(Refer to "Disable CTCI-WIN's default Checksum Offload behavior" in the "Common Problems" section of the CTCI-WIN Help file)

And finally, I do not have IPv6 enabled on my adapter either (whereas you do). Despite all the hype, I've personally never found much use for IPv6. If you have a lot of internet devices maybe you have a need, I don't know. But for me, living without IPv6 is not a problem.

I would also note that at no time during my initial attempts (when my connection attempts would fail and RSED would crash due to a TXF restricted instruction failure), at no time did my system crash. Hercules remained up and running just fine.

If I get time I will MAYBE try to configure my system to use DHCP and also try running IBM Explorer for z/OS on the same system that Hercules is running on, just to see whether that makes any difference or not. I'm doubting it will, but it might be worth a shot.

Personally I think your local Windows network is borked. Your second TTTest64 report is still showing "time=<1ms" for your pings to www.linux.org, which is virtually impossible.

Some things to try/check:

That's all for now.

I'll continue trying to reproduce your crash but so far I haven't had any luck.

cfdonatucci commented 2 years ago

Hi it's great you made it work. I'll take a look at your info as soon as I have time as I'm in between jobs. I really appreciate your help. Take care.

cfdonatucci commented 2 years ago

Is it possible for you to send a couple of screenshots of your fixed IP definitions on Windows? tks

Fish-Git commented 2 years ago

Here you go:

image

image

image

image

Fish-Git commented 2 years ago

What happens when you try to "ping www.linux.org" from z/OS? (i.e. ISPF function 6) or any other IP address? Does "HOMETEST" complete successfully? Did you configure your z/OS "NSINTERADDR" DNS server values in member ADCD.Z24B.TCPPARMS(TCPDATA)?

I still think it's a problem with your Windows host's networking configuration. Since what started this whole mess was your having to reinstall your Windows 10 "after a problem" (what was the problem by the way??), Windows may have installed a default/generic device driver for your networking adapter during the install. Have you checked with your manufacturer (Realtek?) to see if there's a newer version?

Yes, I'm grasping at straws here! I admit it. But your TTTest64 ping test results keeps setting off alarm bells for me!

I admit however, that no matter how screwed up your host networking is, it shouldn't be causing Hercules to crash! Hercules should, ideally, never crash.

I don't think I ever asked you: is this problem (Hercules crashing) reliably reproducible? Does it happen every single time?

Fish-Git commented 2 years ago

Trying to launch IBM Explorer for z/OS from my Windows 7 x64 host system, I'm getting the following error dialog:

---------------------------
Zosexplorer
---------------------------
Java was started but returned exit code=13
-Dorg.eclipse.swt.accessibility.UseIA2=false
-Djava.util.Arrays.useLegacyMergeSort=true
-XX:MaxPermSize=256m
-Djava.class.path=C:\Users\Fish\Downloads\#489\IBM Explorer for zOS\\plugins/org.eclipse.equinox.launcher_1.5.0.v20180512-1130.jar
-os win32
-ws win32
-arch x86_64
-showsplash
-launcher C:\Users\Fish\Downloads\#489\IBM Explorer for zOS\zosexplorer.exe
-name Zosexplorer
--launcher.library C:\Users\Fish\Downloads\#489\IBM Explorer for zOS\\plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.700.v20180518-1200\eclipse_1705.dll
-startup C:\Users\Fish\Downloads\#489\IBM Explorer for zOS\\plugins/org.eclipse.equinox.launcher_1.5.0.v20180512-1130.jar
--launcher.overrideVmargs
-showlocation
-pluginCustomization plugin_customization.ini
-vm C:\Users\Fish\Downloads\#489\IBM Explorer for zOS\jre\bin\j9vm\jvm.dll
-vmargs
-Dorg.eclipse.swt.accessibility.UseIA2=false
-Djava.util.Arrays.useLegacyMergeSort=true
-XX:MaxPermSize=256m
-Djava.class.path=C:\Users\Fish\Downloads\#489\IBM Explorer for zOS\\plugins/org.eclipse.equinox.launcher_1.5.0.v20180512-1130.jar 
---------------------------
OK   
---------------------------

According to "https://stackoverflow.com/questions/11461607/cant-start-eclipse-java-was-started-but-returned-exit-code-13" it's because I don't have a 64-bit version of JDK installed.

I've had MANY/MUCH PROBLEMS with Java in the past on my system, so I am NOT going to try installing java to try to fix the problem. Sorry!  :(

Having a stable host system for Hercules development is of paramount importance to me and I don't trust Oracle at all. Every damn time I try to do something with java it invariably ALWAYS cause me MUCH GRIEF, causing me to spend a LOT of time and effort straigtening out (fixing/undoing) the damage that Java/Oracle has done to my system!

So it looks like I'm not going to be able to reproduce your environment.  :(

We'll just have to try and figure out what's wrong with your Windows 10 networking.

Fish-Git commented 2 years ago

Additional info:

C:\Program Files (x86)\MegaRAID Storage Manager\JRE\bin> java -version
java version "1.8.0-ea"
Java(TM) SE Runtime Environment (build 1.8.0-ea-b88)
Java HotSpot(TM) Client VM (build 25.0-b30, mixed mode)

I'm not sure what "mixed mode" means, but the fact that it's in the "Program Files (x86)" directory tells me it's a 32-bit version of java.

I suppose I could download the 32-bit version of IBM Explorer for z/OS and try that. That might work. Let me think about that...

Fish-Git commented 2 years ago

I suppose I could download the 32-bit version of IBM Explorer for z/OS and try that. That might work. Let me think about that...

The 32-bit version doesn't work for me either. It it gets the same error:

    Java was started but returned exit code=13

So we're just going to have to debug this issue on your system instead. I'm unable to reproduce it on mine. :(

cfdonatucci commented 2 years ago

Yes, I think it's better because there is a lot of stuff to consider. Anyway, I was running a lot of tests trying to use a fixed IP but at some point nothing worked. So I uninstall everything, I verified that almost all entries in the registry were gone, reinstall again and trying to make it work with DHCP. The range of IPs of my router goes from 192.168.1.30 to 192.168.1.63. so IPs around 100 are not used. I defined two devices in z/OS: one CTCI and one OSA. and both were started okay. I ran all tests with the CTCI one using IP 192.168.1.112.

Once z/OS started, I did this.

  1. HomeTEST:

    EZA0619I Running IBM MVS TCP/IP CS V2R4 TCP/IP Configuration Tester         
    EZA0621I The FTP configuration parameter file used will be "TCPIP.FTP.DATA".  
    EZA0602I TCP Host Name is: S0W1                                               
    EZA0605I Using Host Tables to Resolve S0W1                                           
    EZA0611I The following IP addresses correspond to TCP Host Name: S0W1                
    EZA0612I 192.168.1.112                                                                                        
    EZA0614I The following IP addresses are the HOME IP addresses defined in PROFILE.TCPIP:   
    EZA0615I 10.1.10.1                                                                   
    EZA0615I 192.168.1.112                                                               
    EZA0615I 192.168.1.115                                                               
    EZA0615I 10.1.10.1                                                                   
    EZA0615I 127.0.0.1                                                                  
    EZA0618I All IP addresses for S0W1 are in the HOME list!                             
    EZA0622I Hometest was successful - all Tests Passed! 
  2. nestat HOME:

    EZZ2350I MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           16:53:22 
    EZZ2700I Home address list:                                                    
    EZZ2701I Address          Link             Flg                                 
    EZZ2702I -------          ----             ---                                 
    EZZ2703I 192.168.1.112    ETH1             P                                   
    EZZ2703I 192.168.1.115    OSA1                                                 
    EZZ2703I 10.1.10.1        EZASAMEMVS                                           
    EZZ2703I 127.0.0.1        LOOPBACK                                             
    EZZ2704I Address          Interface        Flg                                 
    EZZ2704I -------          ---------        ---                                 
    EZZ2703I 10.1.10.1        EZAZCX 
  3. PINGS:

    
    Pinging 192.168.1.112 with 32 bytes of data:
    Reply from 192.168.1.112: bytes=32 time=2ms TTL=64
    Reply from 192.168.1.112: bytes=32 time=1ms TTL=64
    Reply from 192.168.1.112: bytes=32 time=2ms TTL=64
    Reply from 192.168.1.112: bytes=32 time=1ms TTL=64

Ping statistics for 192.168.1.112: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 1ms, Maximum = 2ms, Average = 1ms

C:\Users\Carlos>ping 192.168.1.115

Pinging 192.168.1.115 with 32 bytes of data: Reply from 192.168.1.115: bytes=32 time<1ms TTL=64 Reply from 192.168.1.115: bytes=32 time=3ms TTL=64 Reply from 192.168.1.115: bytes=32 time=1ms TTL=64 Reply from 192.168.1.115: bytes=32 time=2ms TTL=64

Ping statistics for 192.168.1.115: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 0ms, Maximum = 3ms, Average = 1ms


These PINGS were executed using TSO option 6.:

CS V2R4: Pinging host linux.org (172.67.148.63)
Ping #1 response took 0.006 seconds.

CS V2R4: Pinging host github.com (140.82.112.4)
Ping #1 response took 0.170 seconds.


4. I was able to start a TSO session using CTCI connection.

5. TXOFF executed and RSED started.

6. At this point I attempted a connection from z/OS Explorer. The connection is established as you can see, but at this point the adapter failed. This failure can be consistently reproduced.

Options: CONN TCP TCPIP STACK TITLES ( CLI RSED*
EZZ2350I MVS TCP/IP NETSTAT CS V2R4 TCPIP Name: TCPIP 17:01:39 EZZ2585I User Id Conn Local Socket Foreign Socket State EZZ2586I ------- ---- ------------ -------------- ----- EZZ2587I RSED 00000042 0.0.0.0..4035 0.0.0.0..0 Liste EZZ2587I RSED1 0000006D 192.168.1.112..9308 192.168.1.36..50679 Estab


and this error is issued:

11.01.15 STC09959 +FEK115E write() failed. reason=(EDC5140I Broken pipe.)


> _EDC5140I Broken pipe._
> _**Explanation:** A write was attempted on a pipe or FIFO for which there was no process to read the data. This message is equivalent to the POSIX.1 EPIPE errno._
> _**System action:** The request fails. The application continues to run._
> _**Programmer response:** Refer to z/OS XL C/C++ Runtime Library Reference for the function being attempted for the specific reason for failure._

7. Hercules and z/OS stopped and started again. I'll try with z/OS Connect.
Using zCEE I could acquire a connection as well. I could do some tasks with services and APIs:

Options: CONN TCP TCPIP STACK TITLES ( CLI RSED ZC
EZZ2350I MVS TCP/IP NETSTAT CS V2R4 TCPIP Name: TCPIP 18:09:29 EZZ2585I User Id Conn Local Socket Foreign Socket State EZZ2586I ------- ---- ------------ -------------- ----- EZZ2587I ZCEESRV1 00000066 0.0.0.0..9001 0.0.0.0..0 Liste EZZ2587I ZCEESRV1 00000065 0.0.0.0..9002 0.0.0.0..0 Liste EZZ2587I ZCEESRV1 00000064 0.0.0.0..9000 0.0.0.0..0 Liste EZZ2587I ZCEESRV1 00000067 0.0.0.0..9003 0.0.0.0..0 Liste EZZ2587I ZCEESRV1 000000AC 192.168.1.112..9002 192.168.1.36..53681 Estab EZZ2587I ZCEESRV1 00000063 0.0.0.0..9004 0.0.0.0..0 Liste EZZ2587I ZCEESRV1 00000047 127.0.0.1..1025 0.0.0.0..0 Liste


8. When I tried to deploy a service using that connection, the adapter crashed again.
When Hercules is stopped, it hangs:

18:17:52.348 0000180C HHC00417I 0:0AA8 CKD file d:/ZOS240/dasd/S1C521: cache hits 460, misses 152, waits 0 18:18:11.513 * "Hercules" forcibly terminated by user request 18:18:11.513 Hercules: kill 0000180C 18:18:11.513 kill 0x0000180C (Hercules)



_**Summary:**_

It's a strange failure, because the connections are established but something makes them fail. I had to reinstall windows because I upgraded motherboard, CPU and memory. Hercules ran very slowly in my other hardware, now it doesn't do too bad.  I also installed another Windows, from other ISO file. 
Fish-Git commented 2 years ago

and this error is issued:

11.01.15 STC09959  +FEK115E write() failed. reason=(EDC5140I Broken pipe.)

I had this same error before updating my Windows Firewall. The clue is:

EDC5140I Broken pipe. Explanation: A write was attempted on a pipe or FIFO for which there was no process to read the data. This message is equivalent to the POSIX.1 EPIPE errno.

Try temporarily disabling your Windows Firewall, or adding a rule like I did further above. (And don't forget to disable the rule afterwards when you're no longer using it or no longer need it, and/or re-enable the Windows Firewall again if you chose to completely disable it, so that you're network security isn't left in an exposed state! This is just a temporary test after all! Normally you shouldn't be disabling the Firewall, but we need to temporarily do so to determine whether or not it's the cause.)

Fish-Git commented 2 years ago

P.S. Also, if you haven't done so already, you should add a "Ping" rule to your Windows Firewall as well, as explained in the "Add a "Ping" rule to Windows Firewall" topic of the "Common Problems" chapter of the CTCI-WIN Help file.

Fish-Git commented 2 years ago

You might also need to do a network trace (e.g. Wireshark) to find out what's actually going on. That should let us know whether the packets are actually getting sent or not. If they are but the recipient isn't receiving them, it's more than likely the firewall.

You might need to add a custom (specific) Windows Firewall rule to let through all packets from z/OS (i.e. from IP addresses 192.168.1.112 and 1.115).

cfdonatucci commented 2 years ago

Hi I did some additional testing with both the firewall completely disabled and with new rules, which I'm assuming, I defined properly. It didn't work. So I installed Wireshark and ran a test capturing 192.168.1.112.

First, I logged into an IP terminal, which was okay, and saw the trace. Then I logged off. That part is not in the attached file.

Then I started a session from z/OS Explorer using Cicsplex manager, and it worked perfectly!

Then I did the usual stuff with RSED, and I got the error.

In the trace you'll see a TLS error:

10255   896.578269  192.168.1.36    192.168.1.112   TLSv1.2 61  Alert (Level: Fatal, Description: Unexpected Message)

then several:

10272   897.015414  192.168.1.112   192.168.1.36    TCP 54  [TCP Dup ACK 10269#1] 4035 → 54492 [ACK] Seq=1 Ack=28 Win=131040 Len=0

and finally, many:

10386   904.250068  192.168.1.112   192.168.1.36    TCP 1514    [TCP Spurious Retransmission] 9308 → 54493 [PSH, ACK] Seq=17 Ack=653 Win=130400 Len=1460.

Port 54993 was the one I was connected with:

EZZ2350I MVS TCP/IP NETSTAT CS V2R4       TCPIP Name: TCPIP           13:40:37
EZZ2585I User Id  Conn     Local Socket           Foreign Socket         State
EZZ2586I -------  ----     ------------           --------------         -----
EZZ2587I RSED     00000044 0.0.0.0..4035          0.0.0.0..0             Liste
EZZ2587I RSED9    0000005B 192.168.1.112..9308    192.168.1.36..54493    Estab

So I have to see why TLS is involved here, because I thought I disabled it.

I hope this helps.

Please let me know if you want some special print of the trace.

Also... when I installed Wireshark, I was notified about this:

Should I install Npcap?

cfdonatucci commented 2 years ago

Hi

I officially installed Wireshhark and rebooted the PC. (I was using the portable version before.)

I started the .112 trace before starting Hercules. At the beginning of the trace, I see the same IP .112 has two macs. Don't know how that can be possible.

After that the same error occurred:

7   153.340184  02:00:5e:a8:01:73   Broadcast   ARP 42  ARP Announcement for 192.168.1.112 (duplicate use of 192.168.1.112 detected!)
8   153.340261  02:00:5e:a8:01:70   02:00:5e:a8:01:73   ARP 42  Gratuitous ARP for 192.168.1.112 (Reply)
9   158.600785  02:00:5e:a8:01:73   02:00:5e:a8:01:73   ARP 42  Gratuitous ARP for 192.168.1.112 (Reply) (duplicate use of 192.168.1.112 detected!)
10  189.193929  AskeyCom_79:c8:10   Broadcast   ARP 60  Who has 192.168.1.112? Tell 192.168.1.1
11  189.194015  02:00:5e:a8:01:70   AskeyCom_79:c8:10   ARP 42  192.168.1.112 is at 02:00:5e:a8:01:70
12  189.194112  02:00:5e:a8:01:73   AskeyCom_79:c8:10   ARP 42  192.168.1.112 is at 02:00:5e:a8:01:73

New trace:

Bye.

Fish-Git commented 2 years ago

and ran a test capturing 192.168.1.112.

You should have captured both 192.168.1.112 as well as 192.168.1.115 too. Earlier you said:

I defined two devices in z/OS: one CTCI and one OSA.

And we can see the following in your HOMETEST:

EZA0614I The following IP addresses are the HOME IP addresses defined in PROFILE.TCPIP:   
EZA0615I 10.1.10.1                                                                   
EZA0615I 192.168.1.112                                                               
EZA0615I 192.168.1.115

So it looks like you've defined both IP addresses to your z/OS guest. I'm presuming one of them was assigned to the CTCI device and one was assigned to the OSA device.

Then I started a session from z/OS Explorer using Cicsplex manager, and it worked perfectly!

Fantastic!

Then I did the usual stuff with RSED...

Wait... WHAT?!

You "did the usual stuff"? What does that mean? Do you mean to started (ran) TXOFF and then started RSED afterwards? Is that what you mean? Did you cancel/kill the existing RSED beforehand? Because if you didn't, then you would end up with TWO running RSED instances, which might explain the "TCP Dup ACK" and "TCP Spurious Retransmission" errors you're seeing in your Wireshark trace.

Did you disconnect from your previous session before you "did the usual stuff with RSED"? That might explain things as well.

As far as I know, you can (should) only have ONE and only one running instance of RSED (unless each instance is listening for connections on a completely different port of course). Having multiple server instances each listening for connections on the same server port is a recipe for disaster.

Should I install Npcap?

No.

Well... technically... you can if you want to. But if you do, any problems you might have with CTCI-WIN and/or Hercules networking in general are your own to resolve. Using Npcap instead of WinPCap in unsupported by CTCI-WIN. It might work, or it might not. I don't know. I've never tried it and am not interested in trying it. WinPCap works fine.

Conclusion:

Based on the fact that you were able to successfully establish a z/OS Explorer, it sounds to me like your Windows Firewall was the culprit all along. Which makes sense. Hercules (specifically, z/OS's RSED) was trying to communicate with your z/OS Explorer client, and Windows Firewall wasn't letting anything through. Thus the broken pipe write failures. As soon as you disabled(?) (i.e. "fixed") your Windows Firewall issue, things started working.

NOW your only problem is having two different IP addresses assigned to your z/OS guest because you have two different adapters/interfaces defined: one CTCI and one OSA.

My suggestion would be to choose one or the other and drop the other. Personally I prefer OSA myself. While I'm sure CTCI or even LCS too would both also work just fine, OSA is more modern from a z/OS point of view and thus the device/protocol that z/OS more than likely "prefers", so that's what I'd go with: OSA.

But the choice is yours.

p.s. I personally don't recall having to mess with TLS or SSL at all, so I'm not sure what you're referring to having to do? I didn't have to change/configure anything. I just started IBM Explorer for z/OS, connected, and VOILA! I was up and running.

Fish-Git commented 2 years ago

At the beginning of the trace, I see the same IP .112 has two macs. Don't know how that can be possible.

It's probably because you have two networking interfaces defined in your z/OS guest (one, a CTCI device, the other, an OSA device), and (I'm presuming), you've more than likely defined both of them with the same IP address. You need to either get rid of one of them or else assign it a different IP address.

Fish-Git commented 2 years ago
11  189.194015  02:00:5e:a8:01:70  AskeyCom_79:c8:10  ARP  42  192.168.1.112 is at 02:00:5e:a8:01:70
12  189.194112  02:00:5e:a8:01:73  AskeyCom_79:c8:10  ARP  42  192.168.1.112 is at 02:00:5e:a8:01:73

As you can see from your Wireshark trace, the two conflicting MAC address are "02:00:5e:a8:01:70" and "02:00:5e:a8:01:73". Those are MAC addresses that are automatically generated by Hercules, and correspond to IP Addresses 192.168.1.112 and 192.168.1.115.  (x'70' = 112 and x'73' = 115):

If not specified then one will be internally generated in the range 02:00:5E:80:00:00 - 02:00:5E:FF:FF:FF using the low order 23 bits of the IPv4 address. For example, if the ipv4 address is 10.1.2.3 the generated MAC address will be 02:00:5E:81:02:03.

Methinks you need to empty your ARP cache (and/or delete one or both of the entries for .112 and .115), as well as fix your z/OS guest's networking device and IP address assignments.

Once you do that (along with your existing Windows Firewall fix which you've already done), things should start working just fine for you.

AS FAR AS THE ORIGINAL HERCULES CRASH IS CONCERNED...

I'm going to have to presume it's simply a side effect of your Windows Firewall that unfortunately just happened by coincidence impact Hercules. What more than likely happened was whatever packets needed to be sent/received by Hercules as part of z/OS's attempt to halt its networking adapter, got "eaten" by Windows Firewall, causing Hercules to either end up waiting forever for a response to one of its requests, or, for it to wait "too long" (i.e. longer than 20 seconds) for the response.

When that happens, Hercules's "watchdog" thread (who's interval is currently hard coded at 20 seconds) kicks in and notices once of Hercules's guest processors hasn't made any progress for the past 20 seconds (indicating something is very wrong somewhere (no instruction should ever take longer than 20 seconds to complete!)) and so forces a crash dump.

At least that's my working theory anyway.

Do that, and you should be fine.

Hope that helps!

cfdonatucci commented 2 years ago

Very simple test:

This test can be consistently reproduced.

Documents attached:

Have a nice weekend.

mcisho commented 2 years ago

192.168.1.36 is sending Ethernet frames that are too large, with the IP packet containing a length of zero, see frames 7997 and 7999 in the last trace you provided. Check the network settings on 192.168.1.36.

cfdonatucci commented 2 years ago

ok, what should I test? could you be more specific please? are you refering these options? image

mcisho commented 2 years ago

could you be more specific please?

No, I'm afraid I can't, I don't know anything about your machine(s), or your network. All I know is what the Wireshark trace on the Hercules host showed, i.e. that 192.168.1.36 is sending Ethernet frames that appear to be unusual.

All I can suggest is that you disable any offloads, and check the MTU that is in use.

Fish-Git commented 2 years ago

Ian said:

192.168.1.36 is sending Ethernet frames that are too large, with the IP packet containing a length of zero, see frames 7997 and 7999 in the last trace you provided. Check the network settings on 192.168.1.36.

Thank you for that, Ian! I have not had a chance to download or examine Carlos's latest postings yet.   (I just woke up!)   I will do so A.S.A.P., but it sounds like you may have already found the problem.

Carlos said:

are you refering these options?

Yes. Due to the way CTCI-WIN works, since your IBM Explorer for z/OS client is running on the same system that Hercules is running on, packets to/from your IBM Explorer client and your z/OS Hercules guest are being intercepted before they reach the actual physical Windows adapter (which is where the offloading actually occurs), resulting in Hercules receiving packets larger than it can handle. (Your Windows host thinks it is communicating with another physical system somewhere out there on your local internet and so is purposely sending "Large" packets to be efficient since it believes your adapter will properly "offload" them to smaller packets.)

But because WinPCap (CTCI-WIN) intercepts them before they reach the physical adapter, the offloading is not happening and Hercules ends up receiving packets much larger than it can handle. This results in malformed packets being received by your z/OS guest (which is why is keeps trying to disable its OSA adapter as part of its error recovery).

Make sure your "Large Send Offload" and "Jumbo Frame" settings are set to "Disable". This is mentioned in the "Disable an adapter's Large Send Offload (LSO) option" section of the "Common Problems" chapter of the CTCI-WIN Help file.

I will download and examine your latest tests as soon as I've had my first cup of coffee.

Fish-Git commented 2 years ago

Can you post your current Hercules configuration file, please? Thanks.

cfdonatucci commented 2 years ago
Fish-Git commented 2 years ago

Thank you.

I notice you have switched to using Windows adapter 192.168.1.36, whereas before you were using 192.168.1.37. Can you post another TTTest64 report, please? Thanks.

cfdonatucci commented 2 years ago

my IP is changing whenever I started mi pc... now is 37 again... that's why I wanted to use the MAC.

Fish-Git commented 2 years ago

my IP is changing whenever I started mi pc... now is 37 again... that's why I wanted to use the MAC.

Interesting! Usually when you use DHCP, the lease is simply renewed on the IP Address that was already previously assigned, and thus should be stable. I've never heard of a DHCP server assigning a brand new IP address before. I wonder why that's happening? Who's your DHCP server? Your router/gateway? 192.168.1.1? What manufacturer/model is it? (Not important. Just curious.)

Fish-Git commented 2 years ago

FYI: I noticed you've added NETDEV D8-5E-D3-81-FE-1D to your configuration file. Because you have, you now shouldn't need to specify any iface parameter on your OSA device statement. You should now be able to just use:

0400.3  OSA  chpid F0  ipaddr 192.168.1.112  netmask 255.255.255.0 

(When iface is not specified, it defaults to your NETDEV value)

mcisho commented 2 years ago

my IP is changing whenever I started mi pc

Why not use a static IP address?