CTCE links fail using TSAF under VM

SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator

Other

237 stars 89 forks source link

CTCE links fail using TSAF under VM #640

Closed jeff-snyder closed 4 months ago

jeff-snyder commented 5 months ago

Hi,

I've run into a couple issues with CTCEs running with TSAF (Transparent Services Access Facility) on VM. Could someone take a look at these for me and help me find a way to get this working? Thanks!

Please note:

Operating system: Windows 10
Hercules version: 4.7.0.11109-SDL-DEV-g140a0cf8 (except where noted)

When using VM OSes more advanced than VM/SP 5, TSAF fails to open CTCE links. The same links work when a virtual CTCA is used.

When the link fails to come up, the following messages are generated (where "801" is the link address):

ATSL1Y795I Retry limit exceeded on unit 0801 SET_370_MODE
ATSL1Y708E An attempt to reset link 0801 has failed

OSes tested include:

VM/SP 5
VM/ESA 2.4
z/VM 4.4
z/VM 6.3

I have included configurations as well as Hercules, TSAFVM and OPERATOR logs for 2 VM/ESA systems trying to establish a link in the attached zip file. The Hercules logs include CTC debugging information.

JS04 and JS01 are VM/ESA 2.4.0 systems, connected by a CTCE link. Address 801 on JS04 connects to address 804 on JS01.

Files in the zip file:

JS04.hercules.cnf (the Hercules configuration file)
JS04.hercules.log (the Hercules log, with CTC debugging info)
JS04.prt00e.txt (the OPERATOR and TSAFVM logs)
JS01.hercules.cnf
JS01.hercules.log
JS01.prt00e.txt

Additionally, when connecting two VM/SP 5 images, TSAF worked when I was first playing with it on a year old version of Hercules. However, when I updated to the current dev level code to submit this issue, that broke the VM/SP 5 links. When attempting to bring up the link, TSAF reported the following errors (where 710 is the link address), however this appears to be cosmetic, since the link came up on the old version of Hercules with these errors:

ATSL1T797I Unit-check indicated on unit 710
ATSL1T794I Sense bits are '40'X

In the attached zip file, I have included Hercules configurations as well as Hercules, TSAFVM and OPERATOR logs for 2 VM/SP systems trying to establish a link, both in the old version of Hercules that worked and the new version that failed. The Hercules logs include CTCE debugging information.

JS02 and JS10 are VM/SP 5 systems, connected by a CTCE link. Address 710 on JS02 connects to address 702 on JS10.

Config and log files.zip

Files in the zip file:

Old Hercules version: 4.6.0.10917-SDL-DEV-g0468fd04 (works)
- JS02.old.hercules.cnf (the Hercules configuration file)
- JS02.old.hercules.log (the Hercules log, with CTC debugging info)
- JS02.old.prt00e.txt (the OPERATOR and TSAFVM logs)
- JS10.old.hercules.cnf
- JS10.old.hercules.log
- JS10.old.prt00e.txt
New Hercules version: 4.7.0.11109-SDL-DEV-g140a0cf8 (fails)
- JS02.new.hercules.cnf
- JS02.new.hercules.log
- JS02.new.prt00e.txt
- JS10.new.hercules.cnf
- JS10.new.hercules.log
- JS10.new.prt00e.txt

Peter-J-Jansen commented 5 months ago

Hi Jeff,

Interesting, I never tried VM's TSAF, but spent quite some effort before I got SSI to work with CTCE's. Am I correct to assume that even the newer z/VM's also still support TSAF? I would certainly like to make TSAF work as well, and actually already started investigating the logs you sent. The first pair, JS01 <-> JS04, already show that the TSAF startup does not work, which in the past for me was always a big stumbling block, as about every CTC link CCW program solves the startup problem differently.

Solving this TSAF - CTCE problem will take some time, also because I'll be offline soon until the second half of March. So please bear with me.

Cheers,

Peter

jeff-snyder commented 5 months ago

Thanks for looking at this, Peter.

I'll write up a doc about how I have TSAF set up between two z/VMs, hopefully that will be enough to get you started.

One thing to be aware of, the TSAF on SP5 is incompatible with the TSAF on the other versions, so don't compare those logs (JS02 and JS10) hoping to find a fix for ESA and z/VM. When I run SP5 as a guest of VM/ESA, the VCTC links come up, but the SP5 side immediately drops with a message:

VCTCA between VM/ESA and VM/SP 5

VM/ESA side:

add link 818                                                                    
21:35:31 ATSLLM724I Link 0818 added                                             
21:35:36 ATSLMN707I Link 0818 came up

VM/SP 5 side:

add link 700                                                                    
ATSLLM724I Link 700 added                                                       
ATSNHR602E Incompatible release or service level detected on lin                
ATSLLM713I Link 700 deleted

I don't know how much their CTC drivers differ, but given that SP5 is the only 370 mode OS in the mix, it could be a lot.

I'm also working to get a VM trace of the VCTC when TSAF comes up successfully with JS01 is running as a guest of JS04. I should have more on that soon, I just need to do some research first. I've done a VM trace before, but it's been a while and I didn't really know what I was doing then...

Don't worry about the timing. I've got plenty to do until this gets going. Just make sure and let me know if there's anything I can do to help it along.

Jeff

jeff-snyder commented 5 months ago

Hey Peter,

Here's version 0.1 of the document on running TSAF over a VCTCA. I have a section in there for running it over a CTCE, but I'll flesh that out when we get further into this process and I can get past the first step without it failing.

FYI, this is rough and rambles a bit, but I'm hoping you'll be able to follow it and recreate what I've done. I figured it was more important to get it to you sooner and work on polishing it later.

Jeff

TSAF between VMs.pdf

jeff-snyder commented 5 months ago

Hi @Peter-J-Jansen

I'm researching the VM I/O TRACE options and looking at what's avaiable with an eye toward getting it as close to what Hercules provides so it'll be easier to compare and contrast.

z/VM provides the ability to trace the following I/O "types":

SIO, SIOF, SSCH, RIO, RSCH

Use TRACE SIO, SIOF, SSCH, RIO, or RSCH to trace the SIO (START I/O), SIOF (START I/O FAST RELEASE), SSCH (START SUBCHANNEL), RIO (RESUME I/O), or RSCH (RESUME SUBCHANNEL) instructions for all input/output (I/O) devices or for a range of devices. Tracing is limited to instructions that do not produce an operation or privileged operation exception.

I was guessing that the SSCH would be most comparable to what you already have, but maybe it would be SIO? What do you think?

From what I've read so far, I believe I can trace multiple types. So, if you want SSCH and RSCH for example, that should be doable too. Or if you're like me, maybe you want to trace them all in the hopes you'll recognize something? :)

Jeff

jeff-snyder commented 5 months ago

Hi @Peter-J-Jansen

After a bit more reading in the manuals, I found the "trace i/o ccw" command that looks like a good match for what we need. I created the attached trace and log files for TSAFVM on both the host and guest VMs.

Just a reminder, JS16 is the host VM. VCTC 807 in JS16's TSAFVM couples to VCTC 816 on the JS07 (guest) VM.

In the below attached zip, I have included the following files:

JS07.TSAFVM.log.txt (the guest TSAFVM session log)
JS07.TSAFVM.trace.txt (the trace of 816 on the guest)
JS16.TSAFVM.log.txt (the host TSAFVM session log)
JS16.TSAFVM.trace.txt (the trace of 807 on the host)

Jeff

VCTC logs and traces.zip

Peter-J-Jansen commented 5 months ago

Interesting Jeff! The traces show how easy the TSAF startup CCW's apparently are using VCTC's: No OPeration (NOP), Set Extended Mode (SEM), REaD, Sense Adapter State (SAS) on one side only, and then just REaD's and WRiTes. So what's my CTCE code doing wrong?

I'll try to set up TSAF between two z/VM's, as per usual to first reproduce the problem, so I can start debugging for real.

But, as I said, I'm going to be offline for almost 2 weeks, for a little winter vacation over here. So please bear with me.

Cheers,

Peter

jeff-snyder commented 5 months ago

Great, I'm glad it's useful. Let me know if I can gather any more information or if I can help you recreate the environment for testing.

Don't worry about the downtime, it'll take me a day just to reorganize my To Do list and see which of the many things I have listed that I want to play with while you're out!

Jeff

Peter-J-Jansen commented 5 months ago

Hi Jeff,

This is just a short heads-up message, to confirm that I've been able to reproduce the problem between two z/VM 7.3 systems, and that effectively my debugging has started. So please bear with me.

Cheers,

Peter

jeff-snyder commented 5 months ago

Thanks. If there's any way I can help, just let me know.

Jeff

Peter-J-Jansen commented 5 months ago

Hi Jeff,

I think I'm getting close to where the problem lies. Between my two z/VM 7.3 systems I got the TSAF CTCE connection working by adding the additional keyword FICON as a trailing option on the CTCE configuration statement. This might not work on older VM's, but I'd like to know if so. The main difference -- and I hope that's all that is needed -- is that a FICON CTC starts in the so called extended mode, whereas the other CTC emulations start in basic mode. If the FICON option fixes TSAF also on older VM's, then we could consider the problem fixed. If not, then I could add another option, say SEM (for Set Extended Mode, this is also the CTC CCW command with the same function).

I'm quite interested how that FICON option would work for you.

Cheers,

Peter

jeff-snyder commented 5 months ago

Hi Peter,

I've got some quick results for you, I'll have a more detailed report with any-any tests later.

VM/SP5 - As feared, no joy. Same error messages as before:

ATSL1T797I Unit-check indicated on unit 710
ATSL1T794I Sense bits are '40'X

VM/ESA - moderate success:

I had to do a set rdevice xxx type ctca to get the CTCs online. TSAF stilll generated error messages about failing to reset the link but the links connected and the collection formed:

set rdev 804 type ctca
14:36:12 HCPZRP6722I Characteristics of device 0804 were set as requested.
14:36:12 1 RDEV(s) specified; 1 RDEV(s) changed; 0 RDEV(s) created
Ready; T=0.01/0.01 14:36:12
vary on 804
14:36:23 0804 varied online
14:36:23 1 device(s) specified; 1 device(s) successfully varied online

<start TSAF...>

add link 804
14:40:42 ATSLLM724I Link 0804 added
14:40:42 ATSL1Y795I Retry limit exceeded on unit 0804 SET_370_MODE
14:40:42 ATSL1Y708E An attempt to reset link 0804 has failed
14:40:42 ATSL1T797I Unit-check indicated on unit 0804
14:40:42 ATSL1T794I Sense bits are '40'X
14:40:42 ATSL1Y795I Retry limit exceeded on unit 0804 SET_370_MODE
14:40:42 ATSL1Y708E An attempt to reset link 0804 has failed
14:40:42 ATSL1T797I Unit-check indicated on unit 0804
14:40:42 ATSL1T794I Sense bits are '40'X
14:40:47 ATSL1Y795I Retry limit exceeded on unit 0804 SET_370_MODE
14:40:47 ATSL1Y708E An attempt to reset link 0804 has failed
14:40:47 ATSL1T797I Unit-check indicated on unit 0804
14:40:47 ATSL1T794I Sense bits are '40'X
14:40:58 ATSLMN707I Link 0804 came up
14:40:58 ATSMDO546I Quiesce broadcasts issued by JS01
14:40:58 ATSMDO515I JOIN in progress for node JS04, agent is JS01
14:40:58 ATSMYC521I Collection is roughly synchronized
14:40:59 ATSMYC520I Synchronization is now NORMAL
14:41:00 ATSMJK513I Attempting JOIN with node JS04 as the agent
14:41:03 ATSMYC521I Collection is roughly synchronized
14:41:04 ATSMYC520I Synchronization is now NORMAL
q collect
JS01      JS04

z/VM 4.0 - z/VM 6.3 - almost total success!

The CTCAs came online automatically at IPL time. When I started the links in TSAF there was an error message about setting 370 mode, but then the links connected and the collection formed:

add link 816
14:50:12 ATSLLM724I Link 0816 added
14:50:12 ATSL1Y795I Retry limit exceeded on unit 0816 SET_370_MODE
14:50:12 ATSL1Y708E An attempt to reset link 0816 has failed
14:50:23 ATSLMN707I Link 0816 came up
14:50:23 ATSMDO546I Quiesce broadcasts issued by JS07
14:50:23 ATSMDO515I JOIN in progress for node JS16, agent is JS07
14:50:23 ATSMYC521I Collection is roughly synchronized
14:50:24 ATSMYC520I Synchronization is now NORMAL
q collect
JS07      JS16

Like I said, I'll do an any-any test tonight, just to make sure ESA can talk to z/VM and vice-versa, but I wanted to get you some quick results to show the progress you've made. To be honest, if there are no problems with those tests, I can live with the current results for VM/ESA and z/VM. VM/SP is more of an issue but less of a need, since I don't have any TSAF clients that can talk over the link yet.

Thanks! Jeff

jeff-snyder commented 5 months ago

Hi Peter,

Here are the any-any results for Windows. My target environment is the Pi and I'll be testing that today, but since you haven't made any code changes, I don't expect any differences.

FYI, my test procedure is to define a full mesh of CTCE links, i.e. every system connected to every other. After bringing up all the Hercules images and confirming all the links were connected, I IPLed and brought up TSAF everywhere. Then, for each pair of systems, I started the TSAF link on one side as the receiver (first column) and then started the link on the other side as the sender (top row). When the rows and columns meet is the result of the test.

There's obviously an issue with my JS01-JS07 pair, but since JS07 talks to JS04 (the other VM/ESA) and JS01 talks to JS16 (the other z/VM), I'm willing to bet there's a configuration issue or something on my end I need to chase down.

Please let me know if you've got any questions and I'll post the Pi test results as soon as I get them.

Jeff

jeff-snyder commented 4 months ago

Hi Peter,

OK, this was more fun than I expected... :)

I took this opportunity to get current on Hercules on my Pis, since I was running last October's code there until now. Once I did, I found that the VM SP images would connect with FICON as well. Once I discovered that, I re-upgraded my Windows box to the Hercules level for the Pis (listed below) and retested there but it still fails.

If you'd like I can capture some CCW traces for this link on the Pi and then on Windows for comparison.

Here are the test results for all my images on Pis:

Notes:

JS02-JS10 (the VM/SP 5s) connected to each other and stayed up for over 2 hours, then initially failed during my "formal" testing. I had to shutdown the VMs, devinit and IPL to get them reconnected and even then I had to add and delete links a couple times to get them to connect.

JS02 and JS10 connected to other peers and then dropped immediately due to TSAF version mismatches (marked "drop" above). I counted that a success, since the link had to connect for the software to detect the mismatch.

JS07 connected to JS01 during my initial testing. Thereafter, it failed until I did a shutdown, devinit, IPL on Wednesday.

JS10-JS16 On Tuesday, I was unable to get this to connect, even though JS02 (the other VM/SP) connects to JS16. On Wednesday, like JS07, it came right up after a shutdown, devinit, IPL.

On Wednesday, after some additional testing, I was able to get all four non-SP nodes in the same collection. My configuration is a hub and spoke with JS01 as the hub and the other three nodes connected to it. Prior to this, I had all 4 nodes connected in various configurations, but the hub and spoke hadn't been successful yet, though now I think I had it right once or twice and just wasn't patient enough to allow it to resolve the application issues, thinking they were link problems.

After having the hub and spoke up for awhile, "something" (apparently unrelated to the links) happened on the Pi running JS01 (the hub) and JS04 (a spoke). I had to reboot the Pi. As an experiment, I left the other two nodes running, untouched. When I brought JS01 back up, they reconnected immediately and everything took off. Things were a little unstable with TSAF for a few minutes, but those were application issues (authentication failures) rather than problems with the links.

For now, I'm going to keep this scenario up and start passing some files back and forth over it, to see if that detects any issues.

The two SP nodes were connected in their own collection (after some effort, but I got there). Instead of the "shutdown, devinit, IPL" sequence, I tried varying the device offline on both ends and devinit. It seemed to have worked in getting the two SP5 images connected. Unfortunately, the TSAF collection was not stable. It looked like the nodes could connect but then couldn't talk over the link. The link timed out and reconnected and tried to reform the collection, just to fail again. I shutdown both systems, including restarting Hercules. Once I re-IPLed I was able to reconnect them and everything was stable again.

I'll keep working trying to get a client for this scenario, but for now, I don't have any additional testing I can do to cause a load on the link.

I am getting a lot of HHC05079I messages in my Hercules log. Is there a way to suppress them? Or do they indicate a problem we should be looking at?

Thanks, Jeff

Peter-J-Jansen commented 4 months ago

Hi Jeff,

Thanks for your testing feedback! I think we may have things working with the following fix, also without using the FICON option on the CTCE configuration statement. The corrections are just these (the corrected ctcadpt.c is attached here ctcadpt.c.txt ) :

--- ctcadpt_47.c        2024-03-16 17:10:50.632114881 +0100
+++ SDL-hyperion/ctcadpt.c      2024-03-28 10:22:50.986839075 +0100
@@ -1874,12 +1874,13 @@

     // We merge a Unit Check in case the Y state is Not Ready.
     // But only when pUnitStat is still 0 or Unit Check or Busy (no Attn).
-    // SENSE bit 1 for Intervention Required will be set as well.
+    // sense byte 0 bit 1 (Intervention Required) will be set,
+    // and also bit 7 (Interface Discconect / Operation Check).    
     if( IS_CTCE_YNR( pDEVBLK -> ctceyState ) &&
         ( ( *pUnitStat & (~ ( CSW_BUSY | CSW_UC ) ) ) == 0 ) )
     {
         *pUnitStat |= CSW_UC;
-        pDEVBLK->sense[0] = SENSE_IR;
+        pDEVBLK->sense[0] = ( SENSE_IR | SENSE_OC );
     }

     // Produce a CTCE Trace logging if requested, noting that for the
@@ -1956,7 +1957,9 @@
 //  SetSIDInfo( dev, 0x3088, 0x61, ...          ); CISCO 7206 CLAW protocol ESCON connected
 //  SetSIDInfo( dev, 0x3088, 0x62, ...          ); OSA/D device
 //  But the orignal CTCX_init had this :
-    SetSIDInfo( dev, 0x3088, 0x08, 0x3088, 0x01 );
+//  SetSIDInfo( dev, 0x3088, 0x08, 0x3088, 0x01 );
+//  Which is what we used until we made the VM TSAF connection work as well which needed :
+    SetSIDInfo( dev, 0x3088, 0x08, 0x0000, 0x01 );    

     dev->numsense = 2;
     // A version 4 only feature ...
@@ -3082,10 +3085,11 @@
     CTCE_Info.state_y_prev       = pDEVBLK->ctceyState;

     // A system reset at power up initialisation must result in
-    // sense byte 0 bit 4 set showing Intervention Required.
+    // sense byte 0 bit 1 (Intervention Required) to be set,
+    // and also bit 7 (Interface Discconect / Operation Check).    
     if( pDEVBLK->ctce_system_reset )
     {
-        pDEVBLK->sense[0] = SENSE_IR;
+        pDEVBLK->sense[0] = ( SENSE_IR | SENSE_OC );
     }

     // Reset the y-command register to 0, clear any WEOF state,

The changes include the addition, in two places, of the setting of Sense byte 0 bit 7, the Operations Check. I had observed TSAF somehow complaining about the sense byte being x'40' whereas your VCTC trace always showed x'41'. So now also CTCE produces x'41' also, assuming that the CTC resets are so-called "selective resets".

The more important change is the setting of the "Device Type" to x'0000' instead of x'3088' in the SetSIDInfo call. Initially I thought the TSAF CTC startup problem to be related to the CTC extended vs. base mode, but that turned out NOT to be the case (after quite some extra testing, involving about a zillion re-builds and re-IPL's). I had discovered, many years ago, in SB10-7034-05 "ESCON and FICON Channel-to-Channel Reference" that Device Type was always x'0000', hence that was what I took for FICON. All other, older SetSIDInfo calls in Hercules, all the way back to Spinhawk, had it at x'3088', so I didn't dare changing that. Until now. All one needs to to do revert the current solution to it's previous bug for TSAF is reverting this back to the old, wrong x'3088'.

By the way, this TSAF experience is an extra for me, as I think it has the simplest and most straightforward CTC CCW program I've come across, as you can see from the Hercules logs I've included _(EULERR_843.log and EULERS_812.log )_. These show only the HHC05079I messages, which are produced using the ctc debug on ... commands. Without that command (or the t+... command) no HHC05079I messages are produced at all.

On the z/VM OPERATOR console on one of the two systems I tested with (nodename EULER73S, link address at it's side is 0812; the 0800 link addresses are from earlier tests), I observe a delay of more than a minute prior to it successfully connecting to the other side :

TSAFVM  : z/VM V7.3.0    2022-06-23 15:58
TSAFVM  : DMSACP723I D (193) R/O
TSAFVM  : 11:07:35 ATSCTL013I Trace area size is 40K
TSAFVM  : 11:07:35 ATSCAC006I TSAF link statistics and session accounting records will be generated
TSAFVM  : 11:07:35 ATSLLM702E Link unit address 0800 is not valid
TSAFVM  : 11:07:35 ATSLLM701E Driver rejected the new link 0800
TSAFVM  : 11:07:35 ATSLLM713I Link 0800 deleted
TSAFVM  : 11:07:35 ATSMRZ518I RESET: collection now has size 1
TSAFVM  : 11:07:35 ATSCST001I Initialization is complete. The service level is 0300.
TSAFVM  : 11:07:35 ATSL1Y795I Retry limit exceeded on unit 0843 SET_370_MODE
TSAFVM  : 11:07:35 ATSL1Y708E An attempt to reset link 0843 has failed
TSAFVM  : 11:07:35 ATSMRX520I Synchronization is now NORMAL
TSAFVM  : 11:08:49 ATSLMN707I Link 0843 came up
TSAFVM  : 11:09:15 ATSMDO546I Quiesce broadcasts issued by EULER73R
TSAFVM  : 11:09:15 ATSMJK513I Attempting JOIN with node EULER73R as the agent
TSAFVM  : 11:09:15 ATSMYC521I Collection is roughly synchronized
TSAFVM  : 11:09:16 ATSMYC520I Synchronization is now NORMAL
11:11:16 SEND TSAFVM Q LINK
TSAFVM  : Link: 0812
TSAFVM  :      Type:         CTCA
TSAFVM  :      Status:         up    Delay:            15
TSAFVM  :      Neighbor: EULER73R    Neighbor State: ESTABLISHED
TSAFVM  :  
11:11:22 SEND TSAFVM Q COLLECT
TSAFVM  : EULER73R  EULER73S

This can also be seen in the above Hercules log EULERR_843.log on line 34, and in EULERS_812.log on line 33.

I've tested numerous times, but I'll await your feedback on this before committing this ctcadpt.c on the Github development branch.

Cheers,

Peter

jeff-snyder commented 4 months ago

Hi Peter,

Here are my results from Thursday's initial testing.

My initial testing was more general, setting up my desired solution and seeing how close we are to having it work. A few things came up that I documented below to give you something to look at while I concentrate on specific, end-to-end tests for all nodes and platforms. If there's anything else specific you'd like me to test, please let me know.

I updated ctcadpt.c and rebuilt Hercules on all my platforms except Pi1 (it's not used in testing, but I kept it for comparison if needed later) and I removed FICON from all CTCE definitions.

I'm seeing a lot fewer HHC05079I but I'm still seeing some, mostly WAITs. For examples, I've attached JS01 hercules log.txt and JS02 hercules log.txt. It appears JS01 (the "hub") logs messages when talking to other VM levels (SP and z/VM) but not when talking to another VM/ESA (JS04). You can tell by the device numbers in the log. In the SP systems, I'm using devices in the 7xx range, in the non-SP systems, in the 8xx range. The "nn" is the number of the node being connected to (i.e. 804 on JS01 connects to 801 on JS04).

Just a reminder: JS01, JS04 - VM/ESA 2.4.0 JS02, JS10 - VM/SP 5 JS07 - z/VM 4.4.0 JS16 - z/VM 6.3.0

General

I noticed that the SP systems only brought online CTCs at IPL that had something on the other end. I was able to vary them on though, without having to start anything else.

JS02 - only 701 came online at IPL. It was the only other running system at the time. JS10 - all CTCs except 716 came online at IPL. I was starting the nodes in numeric sequence, so JS16 was not up yet.

I did not have to do any "set rdevices" for the CTCs to come online to any system.

Windows

I started each link on JS01 and then went to the other node and attempted to connect to it.

JS01-JS02, I could not get them connected at all. JS01-JS04, came right up after initial 'short' wait, hadn't thought to time it yet. JS01-JS07, came up after 1:06 JS01-JS16, came up after 0:47.

SP-SP test

JS02-JS10, no joy, tried various combinations with no connection.

Notes The JS01-JS04 link was dropped when when JS16 JOINed. Both ends of the link were DISABLED. I restarted both ends and got the JS04 end to a status of 'unknown' but the JS01 end stayed disabled. I deleted both ends from TSAF and detached the CTCs from TSAFVM. Then I varied the CTCs offline to VM/ESA. When I did a 'devinit 804' on JS01, Hercules froze and I was unable to get any response. VM was still up and active via the console, so I went to JS04 and detached that end of the link. Hercules on JS01 immediately "caught up" and was responsive again. I reattached the link on both ends and did another devinit on JS01, which worked normally this time.

I varied the CTC online to JS01 (the CTC came online automatically on JS04 after the devinit) and attached them to TSAFVM. When I added the links to TSAF, it came up after about a minute and JOINed the collection.

Two hours later, all the non-SP system links were still up and in the collection.

At that point, I shut down the Windows images to test on the Pis. When shutting down the systems in reverse sequence, I saw the following. I'm guessing the unit checks began when I shut down the remote Hercules. I didn't notice these until I got to JS01. So, the '40'X bit is still getting through somewhere, apparently related to the other end of the link going away.

TSAFVM  : 00:46:05 ATSLFR705I Link 0816 went down
TSAFVM  : 00:46:15 ATSMBN540I Node JS16 deleted from collection
TSAFVM  : 00:47:14 ATSLFR705I Link 0807 went down
TSAFVM  : 00:47:14 ATSL1T797I Unit-check indicated on unit 0816
TSAFVM  : 00:47:14 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:47:14 ATSL1Y795I Retry limit exceeded on unit 0816 SET_370_MODE
TSAFVM  : 00:47:14 ATSL1Y708E An attempt to reset link 0816 has failed
TSAFVM  : 00:47:19 ATSL1T797I Unit-check indicated on unit 0807
TSAFVM  : 00:47:19 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:47:19 ATSL1T797I Unit-check indicated on unit 0816
TSAFVM  : 00:47:19 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:47:22 ATSMBN540I Node JS07 deleted from collection
TSAFVM  : 00:48:22 ATSL1Y795I Retry limit exceeded on unit 0807 SET_370_MODE
TSAFVM  : 00:48:22 ATSL1Y708E An attempt to reset link 0807 has failed
TSAFVM  : 00:48:22 ATSL1Y795I Retry limit exceeded on unit 0816 SET_370_MODE
TSAFVM  : 00:48:22 ATSL1Y708E An attempt to reset link 0816 has failed
TSAFVM  : 00:48:22 ATSL1T797I Unit-check indicated on unit 0804
TSAFVM  : 00:48:22 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:48:22 ATSLFR705I Link 0804 went down
TSAFVM  : 00:48:22 ATSL1Y795I Retry limit exceeded on unit 0804 SET_370_MODE
TSAFVM  : 00:48:22 ATSL1Y708E An attempt to reset link 0804 has failed
TSAFVM  : 00:48:22 ATSL1T797I Unit-check indicated on unit 0807
TSAFVM  : 00:48:22 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:48:22 ATSL1T797I Unit-check indicated on unit 0816
TSAFVM  : 00:48:22 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:48:27 ATSL1Y795I Retry limit exceeded on unit 0807 SET_370_MODE
TSAFVM  : 00:48:27 ATSL1Y708E An attempt to reset link 0807 has failed
TSAFVM  : 00:48:27 ATSL1Y795I Retry limit exceeded on unit 0816 SET_370_MODE
TSAFVM  : 00:48:27 ATSL1Y708E An attempt to reset link 0816 has failed
TSAFVM  : 00:48:27 ATSL1T797I Unit-check indicated on unit 0804
TSAFVM  : 00:48:27 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:48:29 ATSMBN540I Node JS04 deleted from collection
TSAFVM  : 00:49:30 ATSL1Y795I Retry limit exceeded on unit 0804 SET_370_MODE
TSAFVM  : 00:49:30 ATSL1Y708E An attempt to reset link 0804 has failed
TSAFVM  : 00:49:30 ATSL1T797I Unit-check indicated on unit 0807
TSAFVM  : 00:49:30 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:49:30 ATSL1T797I Unit-check indicated on unit 0816
TSAFVM  : 00:49:30 ATSL1T794I Sense bits are '40'X
TSAFVM  : 00:49:35 ATSL1Y795I Retry limit exceeded on unit 0807 SET_370_MODE
TSAFVM  : 00:49:35 ATSL1Y708E An attempt to reset link 0807 has failed
TSAFVM  : 00:49:36 ATSL1Y795I Retry limit exceeded on unit 0816 SET_370_MODE
TSAFVM  : 00:49:36 ATSL1Y708E An attempt to reset link 0816 has failed
TSAFVM  : 00:49:36 ATSL1T797I Unit-check indicated on unit 0804
TSAFVM  : 00:49:36 ATSL1T794I Sense bits are '40'X

As verification, a minute after I shutdown TSAF on JS16 on the Pi on Friday morning, I saw the following at JS01:

TSAFVM  : 14:21:39 ATSLFR705I Link 0816 went down
TSAFVM  : 14:21:58 ATSMBN540I Node JS16 deleted from collection

Nothing changed when I shut down z/VM, but about a minute after shutting down Hercules, I started receiving:

TSAFVM  : 14:30:45 ATSL1T797I Unit-check indicated on unit 0816
TSAFVM  : 14:30:45 ATSL1T794I Sense bits are '40'X
TSAFVM  : 14:30:50 ATSL1Y795I Retry limit exceeded on unit 0816 SET_370_MODE
TSAFVM  : 14:30:50 ATSL1Y708E An attempt to reset link 0816 has failed

These messages repeat every minute. Deleting the link on JS01 stops them from recurring.

I did not see these messages when shutting down the SP images.

PIs

I started each link on JS01 and then went to the other node and attempted to connect to it.

JS01-JS02, after a couple minutes the link came up by itself and then dropped due to incompatable software levels JS01-JS04, came up after 1:03 and joined the collection JS01-JS07, came up after 0:59 and joined the collection JS01-JS16, came up after 1:14 and joined the collection

SP-SP

The JS02-JS10 link eventually came up after bouncing the CTCs several times and waiting over 5:00. At one point TSAFVM on JS02 abended trying to manage storage (i.e. in DMSFREE), but I just bumped the VM size up and restarted it. I figured that was an application error and probably not related to the CTCs. I'm sure we're one PUT tape away from that being fixed... :)

On Friday, I stopped TSAF on JS10 and then waited to see how long it would take the other end to notice. After 10 minutes of nothing on JS02, I logged off the TSAFVM virtual machine. Then JS02 immediately noticed the link go down.

Notes

I left the PIs up overnight and when I checked them today, I found repetitions of the following on the consoles:

TSAFVM  : 03:44:03 ATSMYC520I Synchronization is now NORMAL
TSAFVM  : 05:08:29 ATSMDO546I Quiesce broadcasts issued by JS01
TSAFVM  : 05:08:34 ATSMBN540I Node JS07 deleted from collection
TSAFVM  : 05:08:34 ATSMDO515I JOIN in progress for node JS07, agent is JS01
TSAFVM  : 05:08:35 ATSMYC521I Collection is roughly synchronized
TSAFVM  : 05:08:38 ATSMYC520I Synchronization is now NORMAL
TSAFVM  : 06:50:51 ATSMDO546I Quiesce broadcasts issued by JS01
TSAFVM  : 06:50:59 ATSMBN540I Node JS16 deleted from collection
TSAFVM  : 06:50:59 ATSMDO515I JOIN in progress for node JS16, agent is JS01
TSAFVM  : 06:51:01 ATSMYC521I Collection is roughly synchronized
TSAFVM  : 06:51:04 ATSMYC520I Synchronization is now NORMAL
TSAFVM  : 08:13:49 ATSMDO546I Quiesce broadcasts issued by JS01
TSAFVM  : 08:13:57 ATSMBN540I Node JS16 deleted from collection
TSAFVM  : 08:13:57 ATSMDO515I JOIN in progress for node JS16, agent is JS01
TSAFVM  : 08:13:58 ATSMYC521I Collection is roughly synchronized
.
.
.

From looking at the Hercules log, some of these appear to correlate to HHC05079I WAIT messages. See the "JS01 hercules log.txt" above and JS01 OPERATOR TSAFVM LOG.txt files for examples.

Well, hopefully that will give you something to look at while I do my in-depth testing tonight. I'll send out a status message tomorrow with more updates.

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi Peter,

Unfortunately, life got in the way last night and I wasn't able to do my testing. Just to get something done, I thought I'd test VTAM CTCE links to make sure what we were doing didn't adversely affect those connections and I'll do my Windows testing today (Saturday).

So, lastnight I brought up 5 of the 6 nodes on VTAM (JS07 doesn't have VTAM installed). I didn't see any errors in the logs today when I checked them out. After confirming VTAM looked OK, I shut down all the systems. Interestingly, VTAM on JS01 (the last system to be shutdown) hung on shutdown. I was unable to force it or didn't wait long enough for the force to work; I waited a couple minutes and tried to force it again, just to be told a force was pending. Eventually, I just shutdown with it still running. I've never run into this kind of error, so I don't think it's my configuration at fault.

After my TSAF testing, I brought VTAM and TSAF both up at the same time. Everything seems to coexist OK. I then shut the systems down, hoping to recreate the VTAM issue if there is one. VTAM came down cleanly, so I don't know what to make of it. Maybe it was a one-time thing I'll never see again.

My TSAF testing results:

(Jeff: updated 3/31 to reflect the correct Hercules version used in testing.)

Notes:

I am still seeing HHC05079I messages, but not nearly as many as before. Interestingly, I see them occasionally for links that are not currently in use.

As an experiment before getting started, I started Hercules, IPLed JS01, and started TSAF, with no other links connected. After I activated link 802 (the "nonexistent" connection to JS02), I began receiving "sense bits" messages.

TSAFVM  : 13:49:38 ATSL1Y795I Retry limit exceeded on unit 0802 SET_370_MODE
TSAFVM  : 13:49:38 ATSL1Y708E An attempt to reset link 0802 has failed
TSAFVM  : 13:49:38 ATSL1T797I Unit-check indicated on unit 0802
TSAFVM  : 13:49:38 ATSL1T794I Sense bits are '40'X

These stopped as soon as I started Hercules for JS02. Then, I received HHC05079I messages on JS02 until I actually IPLed VM/SP.

At this point I restarted my test so I can bring up and IPL all images before starting any links.

At this point, I shut down everything and started over. I went through the systems 1 at a time and started Hercules and IPLed the VM image. Once they were all up, I started testing the individual links.

A couple oddities to note...

When I started JS10 - JS01: I started 701 on JS10, then I started 810 on JS01. The link up message was delayed 3 minutes until the link was bounced on JS10.

 send tsafvm add link 810
Ready; T=0.01/0.01 15:11:51
TSAFVM  : 15:15:29 ATSLLM724I Link 0810 added
TSAFVM  : 15:15:29 ATSL1Y795I Retry limit exceeded on unit 0810 SET_370_MODE
TSAFVM  : 15:15:29 ATSL1Y708E An attempt to reset link 0810 has failed

Eventually, I bounced the connections on both ends and still the link did not connect. After I gave up and shut down both links, I noticed the Hercules CTC status is now "open busy" while all others are "open" (even while in use).

This happened again with the JS10 - JS04 link. JS10 - start tsaf, add link 704 JS04 - start tsaf, add link 810 link added message delayed on JS04

 send tsafvm add    link 810
Ready; T=0.01/0.01 16:27:34
 send tsafvm q links
Ready; T=0.01/0.01 16:28:39
TSAFVM  : 16:28:58 ATSLLM724I Link 0810 added
TSAFVM  : 16:28:58 ATSL1Y795I Retry limit exceeded on unit 0810 SET_370_MODE
TSAFVM  : 16:28:58 ATSL1Y708E An attempt to reset link 0810 has failed
TSAFVM  : Link: 0810
TSAFVM  :      Type:         CTCA
TSAFVM  :      Status:    unknown    Delay:             0
TSAFVM  :      Neighbor:             Neighbor State: unknown
TSAFVM  :

I ended up bouncing both ends of the link, trying to get it up, with no luck.

Summary

I have included a zip file with all the Hercules, OPERATOR, and TSAFVM logs. Unfortunately, the JS02 log got truncated for some reason I don't understand. If you need information specific to that system, let me know and I'll retest on it.

Log Files.zip

I would say we are better off with the code as modified. While we're not 100% there yet, (especially for the SP systems), we are closer in that we see fewer errors than before. I'll be doing more testing on the PIs in the next day, but unless something really surprising comes up, I would recommend going ahead with the commit and building any future enhancements from that base.

Thanks, Jeff

Peter-J-Jansen commented 4 months ago

Hi Jeff,

Thanks for your feedback and all your tests! A few minutes ago I committed a291e7e to the development branch which will also avoid the TSAF messages concerning sense '40'X.

As to the SP testing: did you also try without the ATTNDELAY 200 option on the CTCE configuration?

Cheers,

Peter

jeff-snyder commented 4 months ago

Peter,

Good news on the commit. I'll update my images soon, so my version doesn't show "-modified."

I had not done any testing with ATTNDELAY removed. I took it off today and retested the failing Windows links. They all now work. JS01->JS02 failed the first time I tried it, but when I went back afterwards and tried to get a trace for you, it worked 3 times in a row. It has since also worked with trace off.

Now, off to my PI testing... :)

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi Peter,

Looks like good news on the PIs as well. I removed ATTNDELAY from the SP images and retested everything with the modified CTCADAPT.C module and everything worked.

I got some HHC05079I error messages, but nowhere near as many as before. Some of them were for links not currently in use, so they will go away when this goes into my "production" network. In almost 5 hours of Hercules uptime, the following HHC05079I message counts were logged:

JS01 - 16
JS02 - 4
JS04 - 0
JS07 - 1
JS10 - 12
JS16 - 1

After my any-any test, I linked the two SP systems in one collection and the other 4 systems in another. I also started VTAM on 4 of the images (JS01, JS02, JS04, JS16). If left all the images up for about 3 hours when I took time off for dinner. When I came back, they were all still connected, with no VTAM or TSAF error messages.

Thanks, Jeff

jeff-snyder commented 4 months ago

Peter,

After my any-any test, I linked the two SP systems in one collection and the other 4 systems in another. I also started VTAM on 4 of the images (JS01, JS02, JS04, JS16). If left all the images up for about 3 hours when I took time off for dinner. When I came back, they were all still connected, with no VTAM or TSAF error messages.

Unfortunately, this did not hold overnight for the non-SP collection. The SP5-SP5 collection stayed up all night and was fine when I shut it down this morning, logging only 1 HHC05079I message all night. However, none of the non-SP links survived the night. By morning, all the links had failed and each system's collection had reverted to only the local host.

So far, I'm not seeing an obvious correlation between any Hercules and OPERATOR log events.

There are a lot of HHC05079I messages and some of them seem to be the obvious trigger to TSAF events, like this one.

Source          Time    MsgID       Text
JS01 Hercules.txt:  0:35:52 HHC05079I   0:0816 CTCE: <- 0:0801 #0039 cmd=WRT=01 xy=aW<-aa l=00C4 k=190C9688 Stat=80 CC=0 w=1,r=0 WAIT ATTN->RC=0 Busy_Waits=5
JS16 Operator.txt:  0:35:57 ATSMBN540I  Node JS04 deleted from collection
JS16 Operator.txt:  0:35:57 ATSMBA530E  Authentication check failed on message from node JS01, Message type: DB
JS16 Operator.txt:  0:35:57 ATSMBA530E  Authentication check failed on message from node JS01, Message type: DB
JS16 Operator.txt:  0:35:57 ATSMJK513I  Attempting JOIN with node JS01 as the agent
JS01 Operator.txt:  0:35:59 ATSMDO546I  Quiesce broadcasts issued by JS01
JS04 Operator.txt:  0:35:59 ATSMDO546I  Quiesce broadcasts issued by JS01
JS07 Operator.txt:  0:35:59 ATSMDO546I  Quiesce broadcasts issued by JS01
JS01 Operator.txt:  0:36:04 ATSMBN540I  Node JS16 deleted from collection
JS01 Operator.txt:  0:36:04 ATSMDO515I  JOIN in progress for node JS16, agent is JS01
JS04 Operator.txt:  0:36:04 ATSMBN540I  Node JS16 deleted from collection
JS04 Operator.txt:  0:36:04 ATSMDO515I  JOIN in progress for node JS16, agent is JS01
JS07 Operator.txt:  0:36:04 ATSMBN540I  Node JS16 deleted from collection
JS07 Operator.txt:  0:36:04 ATSMDO515I  JOIN in progress for node JS16, agent is JS01
JS01 Operator.txt:  0:36:05 ATSMYC521I  Collection is roughly synchronized
JS04 Operator.txt:  0:36:05 ATSMYC521I  Collection is roughly synchronized
JS07 Operator.txt:  0:36:05 ATSMYC521I  Collection is roughly synchronized
JS16 Operator.txt:  0:36:05 ATSMYC521I  Collection is roughly synchronized
JS01 Operator.txt:  0:36:09 ATSMYC520I  Synchronization is now NORMAL
JS04 Operator.txt:  0:36:09 ATSMYC520I  Synchronization is now NORMAL
JS07 Operator.txt:  0:36:09 ATSMYC520I  Synchronization is now NORMAL
JS16 Operator.txt:  0:36:09 ATSMYC520I  Synchronization is now NORMAL

Then there are other HHC05079Is that seemingly go unnoticed...

Source          Time    MsgID       Text
JS01 Hercules.txt:  0:41:23 HHC05079I   0:0816 CTCE: <- 0:0801 #00A5 cmd=WRT=01 xy=aW<-aa l=03FC k=8C393F0F Stat=80 CC=0 w=1,r=0 WAIT ATTN->RC=0 Busy_Waits=4
JS04 Operator.txt:  1:24:54 ATSMBN540I  Node JS16 deleted from collection

I.e. nothing logged for almost an hour after this HHC message.

And other TSAF events with no obvious causes.

Source          Time    MsgID       Text
JS01 Hercules.txt:  2:47:32 HHC05079I   0:0816 CTCE: <- 0:0801 #0051 cmd=WRT=01 xy=aW<-aa l=03FC k=D44616FB Stat=80 CC=0 w=1,r=0 WAIT ATTN->RC=0 Busy_Waits=5
JS07 Operator.txt:  3:33:17 ATSMBN540I  Node JS04 deleted from collection
JS07 Operator.txt:  3:33:17 ATSMBA530E  Authentication check failed on message from node JS01, Message type: DB
JS07 Operator.txt:  3:33:17 ATSMJK513I  Attempting JOIN with node JS01 as the agent
JS16 Operator.txt:  3:33:17 ATSMBN540I  Node JS04 deleted from collection

I.e. separated by time and system from the previous message.

By the way, I have not seen any HHC WAIT messages with a Busy_Waits value under 3. Is is possible they're just not being reported but still impacting TSAF? Maybe that could be the explanation for some of the "untriggered" TSAF events.

In the attached zip file (Logs0401.zip), I have included Hercules and Operator logs for each system in the collection. I have trimmed the files down to data from midnight to 13:45 on Monday, April 1. The raw logs included data from my any-any tests and the system shutdowns that just added confusion. I have also included:

ATSM.txt - all the TSAF files (ATSM message prefix) from the operator logs
HHC05079I.txt - all the HHC messages
Merged.csv - file that includes all the HHC05079I and TSAF messages in one file, sorted by time

If I could find a way to recreate one of these events, I could easily capture a trace for you to look at, but at this point, unless you have a better idea, all I can think of is to start the trace and let it run for possibly hours, until the event recurred. Then we can trim the files to maybe the most recent 1/2 hour of activity for comparison. Is there anything else you might want to see in the trace, like the start up sequence for each link?

While you get a chance to review this and respond, I'll try a first trace attempt. I'll limit it to two systems, JS01 as the hub and JS16. I'll start that up now and see what I get in a few hours.

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi Peter,

Well, it's been several hours and I haven't seen an error more significant than this synchronization message, though it has happened multiple times. Obviously something happened to make TSAF resynchronize, but I don't know what. No link errors have happened and neither peer has dropped from the collection.

I went ahead and grep'ed out 1/2 hour that includes the event (from 00:20:00 to 00:49:00) in the Hercules trace log. Hopefully, that will be enough to give you an idea of what was going on at the time. If you need more data, just ask.

In the meantime, I'll leave this scenario running overnight and see if we catch anything more interesting. If we don't see something more impactful by tomorrow, I'll probably try a different "remote" node (e.g. switch JS16 out for JS07). I don't want to have to put all 4 systems back up and manage all that trace data if we don't have to.

JS01 - Operator log

 send tsafvm q collect
TSAFVM  : JS01      JS16
Ready; T=0.01/0.01 00:00:23
 send tsafvm q link
TSAFVM  : Link: 0816
TSAFVM  :      Type:         CTCA
TSAFVM  :      Status:         up    Delay:           109
TSAFVM  :      Neighbor: JS16        Neighbor State: ESTABLISHED
TSAFVM  :  
Ready; T=0.01/0.01 00:00:25
TSAFVM  : 00:37:36 ATSMYC520I Synchronization is now NORMAL

JS16 - Operator log

 send tsafvm q collect
TSAFVM  : JS01      JS16
Ready; T=0.01/0.01 00:00:33
 send tsafvm q link
TSAFVM  : Link: 0801
TSAFVM  :      Type:         CTCA
TSAFVM  :      Status:         up    Delay:            62
TSAFVM  :      Neighbor: JS01        Neighbor State: ESTABLISHED
TSAFVM  :  
Ready; T=0.01/0.01 00:00:36
TSAFVM  : 00:37:36 ATSMYC520I Synchronization is now NORMAL

As for the trace logs, they're over 1.5M each, so I have zipped them up: Traces.zip

Thanks, Jeff

Peter-J-Jansen commented 4 months ago

Hi Jeff,

Thanks again for your testing and feedback Jeff !

The HHC05079Imessages only show Busy_Waits >= 3, and are caused when a CTC command is received that requires an ATTN interrupt to be raised, but is not possible because the device is still in a Busy state. They are mostly harmless, unless after 20 Busy_Waits the CTCE gives up and does not signal an ATTN. And even then it is not always harmful, which makes me suspect that the logic deciding if an ATTN is necessary may not be perfect. I just didn't find anything more appropriate or correct, other than checking if a CCW program is already in progress. If no CCW program is in progress, then the CTC spec states an ATTN is needed. It could also be timing related.

I am not sure if these Busy_Wait cases are related to TSAF link outages / synchronization issues you experienced. What I encountered in all of my years of CTCE bug hunting is that at least on two occasions I eventually found out I had intermittent problems with network interface cards (NIC). In both cases it was with a NIC on a motherboard, which I fixed by purchasing an additional new NIC card. The last time it happened was 2 months ago, and I replaced that guilty NIC with an USB-plug 2.5 Gbit/sec NIC device. In both cases inspecting the packet retransmission counters on the routers showed that the faulty NIC devices indeed had had problems. Certainly my first experience caused me a lot of time lost due do searching for a cause that did not apply at all. Frustrating these not-solid-broken but just intermittent NIC problems ! Networking people always suggest to check the packet error counters whenever there is trouble.

Your TSAF query link status shows delays of 109 and 62 msecs, which seems a lot. Mine show around 15 msecs.

Concerning CTCE tracing : the most efficient one for me is produced with ctc debug on <devnum> commands, as these only show mostly a single record per CTC CCW command send or received. Any connection loss will show though.

Cheers,

Peter

jeff-snyder commented 4 months ago

Hi Peter,

I'm ashamed to say, as a former network engineer, I had not considered the real physical hardware in my PD efforts until now. :) I was so tunnel-visioned on the CTCs presented to the OSes by Hercules, I hadn't even thought of the "real" network under them. That would help to explain why there were no errors between JS01 and JS04, since they both resided on one Pi and errors were occurring between JS01-JS07 and JS01-JS16, since JS07 and JS16 reside on a different Pi.

So, today, I've got two paths going.

In an effort to eliminate hardware as a contributor, I have updated my Windows box to the latest commit with the ctcadpt.c changes and I started all 4 non-SP images (1,4,7,16) there. I changed my configuration statements to use 127.0.0.1 as the IP address on these images, so this traffic should never even touch hardware. Once I got the images all up and the links started, I saw the link delay of 15 that you mentioned, as well. Debugging is on and I'll leave this running overnight to see if we catch anything.

On my Pis, I have updated Hercules to the latest commit on all of them. I also rebooted them, to reset the error counters. Some of them had been up almost two months, so I can't say how recent the drops they recorded are. Part of the problem may be performance-related as well. Running multiple VM/ESA or z/VM images on a Pi 4 may tax it to the point where it can't quite keep up with the network and fails intermittently. I'll try running only one VM image on each Pi. That'll only give me three nodes in the collection, but every one of them will be traversing a physical network link.

For now, I'll forego any real testing with SP. I'm working on a copy of VM/SP 6, to get TSAF working and test with that along with SP 5. TSAF on SP 5 generates a bunch of blank messages which may or may not indicate an issue. Maybe TSAF on SP 6 will be more informative.

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi Peter,

Here are the logs from the 4-node collection on Windows. These images were all running on the same host, so there should be no physical network issues to deal with.

As you'll see from the logs, I started the first system (JS01, the hub) around 15:45 yesterday. By 16:00, all nodes were up and in the collection.

JS01 logged a "Timeliness check failed" message at 18:57 that affected the JS01-JS16 link and then apparently casued a waterfall of JOINs and collection synchronization that lasted a few minutes, until 19:06.

18:57:07 TSAFVM  : 18:57:07 ATSMBG531E Timeliness check failed on message from node JS16. Function=TD, Current time=0014991F, Origin
 time=0014925B, Performance time=001494DD

I'm assuming that indicates a drop or delay in receiving traffic from the CTC, but I'll let you decode the debug information to find out.

It looks like things were then stable until 07:59:54 this morning, when JS16 for some reason logged a message indicating JS07 had been dropped from the collection, even though the hub, JS01, had not logged a similar message. JS16 then logged some authentication checks and finally, a completion time expired message, at which point it reset and began attempting to rejoin the collection. It was never successful, even though it tried several times during the day. At 08:00, the other nodes began logging errors and eventually all the nodes ended up disconnected and in their own 1-node collections.

At approximately 14:20, I logged in and noticed all the links were down. Before shutting down the nodes, I tried restarting TSAF on JS01 and when that didn't recover the links, I restarted a couple of them individually. I had to recycle TSAF on JS16 to get it reconnected.

FYI, I looked up the ATSMRX520I message and confirmed it is non-impactful. It looks like its generated when any node joins or leaves the collection and then approximately every 2 hours (apparently based on "Sync period: 7200 seconds" from "q status" display) during stable operations.

520I Synchronization is now NORMAL

Explanation: The logical clocks of the collection are synchronized. This action takes place periodically and at the completion of the join protocol. These messages will be sent to the operator consoles of all members of the collection, including the joiner.

System Action: The system adjusts the logical clocks to maintain synchronization.

Operator Response: None.

Please take a look and see if you find anything anomalous. I'll start collecting the logs for the PI-based collection. Thanks!

Logs 0403a.zip

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi Peter,

Here are the logs from the 3-node, PI-based collection. Again, JS01 is the hub and this time there are two remote nodes, JS07 and JS16. I started bringing up the collection at 17:10 and it was completely up and synchronized by 17:16. Link delays were mostly 31 with one 15, so greatly improved over what we've seen in the past on this platform.

By 20:06 we had our first issue, with JS07 being deleted from the collection due to an authentication check failure. JS07 rejoined the collection and then 4 minutes later...

We have a new error message today!

ATSNRC601E Frame discarded. Hop-count limit reached.

60lE Frame discarded. Hop-Count limit reached.

Explanation: Upon receiving a frame to forward to another node, the routing group found that the frame exceeds the forwarding limit.

System Action: The system discards the frame and continues processing.

Operator Response: Contact your system programmer or service representative.

I'll be interested to hear your evaluation of what might be happening with this. Typically, I would expect to see hop-count limit issues with a much larger network. This is 3 nodes, you can't get any smaller and still have message routing vs. just talking to your peer.

This is certainly the problem of the day. We got many of these and the collection never really recovered until I logged on at about 16:30 today and restarted JS01 and the links.

As for the packet drops on the interface, all the PIs experienced them. Interestingly, all 3 of the PIs dropped 90 packets. This implies to me that the problem may be with my Netgear switch instead of the individual PIs. In each case, soon after I reloaded the PIs yesterday, they all had either 5 (1 PI) or 6 (2 PIs) drops. Today, after shutting down Hercules on them, they had either 95 (the same PI that had 5) or 96 drops. Fortunately, TCP is a "lossless" protocol and retransmits the dropped packets, unfortunately, the time it takes to do that may cause issues for CTCE and/or TSAF.

For example, interface eth0 on Pi3 (hosting JS01):

Before -

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether e4:5f:01:0c:95:7f  txqueuelen 1000  (Ethernet)
        RX packets 19275  bytes 2175617 (2.0 MiB)
        RX errors 0  dropped 6  overruns 0  frame 0
        TX packets 15184  bytes 14427497 (13.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

After -

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether e4:5f:01:0c:95:7f  txqueuelen 1000  (Ethernet)
        RX packets 1528427  bytes 128188191 (122.2 MiB)
        RX errors 0  dropped 96  overruns 0  frame 0
        TX packets 1151466  bytes 117738484 (112.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

shows it experienced 90 drops.

FYI, you'll see greatly increased TX bytes on JS16 and RX bytes on JS07. That was caused by me moving JS07 from PI4, which was originally hosting JS07 and JS16 to PI2 so it could run on it's own platform.

Please take a look when you get a chance and if you have any questions, give me a shout. Thanks!

Logs-0403b.zip

Thanks, Jeff

Fish-Git commented 4 months ago

JS01 logged a "Timeliness check failed" message at 18:57 that affected the JS01-JS16 link and then apparently casued a waterfall of JOINs and collection synchronization that lasted a few minutes, until 19:06.
18:57:07 TSAFVM  : 18:57:07 ATSMBG531E Timeliness check failed on message from node JS16. Function=TD, Current time=0014991F, Origin
 time=0014925B, Performance time=001494DD
I'm assuming that indicates a drop or delay in receiving traffic from the CTC, but I'll let you decode the debug information to find out.

Forgive me for treading into unfamiliar territory, but just out of curiosity, are the two nodes in question (JS01 and JS16) on different physical machines? Even though I know absolutely nothing about TSAF, I'm wondering if guest TOD clock synchronization might be an issue here?

Another thought (question for Peter): do your communications links disable Nagle? If not, might that be something else that could cause such problems? (i.e. a delay in timely transmission of short messages due to Nagle?)

Just some thoughts. Please ignore if unwarranted.

jeff-snyder commented 4 months ago

Hi Fish,

Forgive me for treading into unfamiliar territory, but just out of curiosity, are the two nodes in question (JS01 and JS16) on different physical machines? Even though I know absolutely nothing about TSAF, I'm wondering if guest TOD clock synchronization might be an issue here?

No, in this scenario, all the nodes in the collection were running on one system, my Windows box.

Thanks, Jeff

Fish-Git commented 4 months ago

No, in this scenario, all the nodes in the collection were running on one system, my Windows box.

Just out of curiosity, the next time you run your tests, issue the Hercules clocks command as soon after guest IPL as possible. I'd like to see how close (or how far apart) the TOD clocks are on each of the two systems.

Given that they're both running on the same physical system, I would expect that they both should be within a few microseconds of each other, but definitely within at least a millisecond of each other.

Note however that this might prove to be difficult (if not impossible!) to prove however, given that you can't exactly issue both commands individually at the exact same time! I wish there was some way to do that, but I can't for the life of me think of a way to do it. (Batch file maybe??)

Still, it might prove to be interesting??

(Or, ... it could be a complete waste of time too!)

jeff-snyder commented 4 months ago

Hi Fish,

It turns out that the best I could do manually with the clocks command was a couple seconds. However, TSAF does have a "query status" command that shows various information about the collection.

 00:32:54 send tsafvm q status
00:32:54 TSAFVM  : Local node id:                     JS04
00:32:54 TSAFVM  : Current collection time:           00013D7C
00:32:54 TSAFVM  : Collection ID:                     JS01
00:32:54 TSAFVM  : Last synchronous update time:      000120AC
00:32:54 TSAFVM  : Next scheduled sync time:          000F2CE7
00:32:54 TSAFVM  : Sync period:                           7200 seconds
00:32:54 TSAFVM  : Transmission Delay:                    0040
00:32:54 TSAFVM  : Maximum clock deviation:               0081
00:32:54 TSAFVM  : Worst extra clock deviation:       00000001
00:32:54 TSAFVM  : Atomic broadcast duration:         00000102
00:32:54 TSAFVM  : Latest Manual adjustment to duration: 00000000
00:32:54 TSAFVM  : Worst case diffusion hop count:          02
00:32:54 TSAFVM  : Current checksum:                  D9A654FC
00:32:54 TSAFVM  : Number of Nodes in collection:           04
00:32:54 TSAFVM  : Highest bit in use for signatures:       04
00:32:54 TSAFVM  : Current Node map:                  F0000000
00:32:54 TSAFVM  : 1)Processor node: JS01      CPU ID: FF01110170600000
00:32:54 TSAFVM  : 2)Processor node: JS04      CPU ID: FF01110470600000
00:32:54 TSAFVM  : 3)Processor node: JS07      CPU ID: FF01110770600000
00:32:54 TSAFVM  : 4)Processor node: JS16      CPU ID: FF01111628180000
00:32:54 TSAFVM  : 5)Processor node:           CPU ID: 4040404040404040
00:32:54 TSAFVM  : 6)Processor node:           CPU ID: 4040404040404040
00:32:54 TSAFVM  : 7)Processor node:           CPU ID: 4040404040404040
00:32:54 TSAFVM  : 8)Processor node:           CPU ID: 4040404040404040
00:32:54 TSAFVM  : Collection bitmap: F0000000  Base signature: 0000010400000000

Unfortunately, there are no units indicated for the clock deviation numbers, so I don't know if those are microseconds, milliseconds, or God help us, seconds. :)

The IBM manual for present-day z/VM is not very helpful. It's description for the Query Status command includes:

The local node ID should match the node name of the local node corresponding to the processor identification in the SYSTEM NETID file. The collection ID should match one of the processor nodes that are displayed. The atomic broadcast duration estimates the delay before a global resource or gateway is known to all nodes in the TSAF collection. Other fields can be used to establish consistency between the data shown and the output from QUERY STATUS at other TSAF virtual machines in the collection.

Thanks, Jeff

Peter-J-Jansen commented 4 months ago

Hi Fish,

Concerning your Nagle question :

Another thought (question for Peter): do your communications links disable Nagle? If not, might that be something else that could cause such problems? (i.e. a delay in timely transmission of short messages due to Nagle?)

Yes, CTCE disables Nagle, and if the disabling Nagle wouldn't work, message HHC05068E would be issued, which I've never seen happening.

All my TSAF testing was using z/VM 7.3 on two hosts, a regular i7-6700K 4-core 32GB PC with Ubuntu 22.04, and a 8GB RPI4 with RPI OS "bullseye" (the official RPI Debian derivative). The send tsafvm q link commands only ever showed a delay of 15, which I presume to be 15 msecs. Longer running tests of close to two days never showed failures, but that is with idling TSAF's as I wouldn't know how to effectively use TSAF, so it just performs the sync's every 2 hours.

I suspect TSAF nowadays is just left as is, as I think z/VM's SSI kind of replaces TSAF. CTCE supports SSI quite reliably, but SSI uses a more sophisticated CTCE setup, using the recommended 2 CTC's per link, so that each direction (send / receive) get's it's own dedicated CTC. My VTAM setup between 4 z/VM 7.2 SSI systems uses the 4 CTC's per link, using VTAM MPC's, i.e. 2 send/receive pairs for additional redundancy when a connection problem would develop.

Nevertheless, I think the current CTCE fixes make VM's TSAF facility work fine, but within the limitations of what might be temporary intermittent communication problems. CTCE using TCP sockets supplying the basic communication cannot really implement the original native channel-to-channel devices when it comes to error recovery on intermittent communication problems. That's a restriction we may have to live with.

Cheers,

Peter

Peter-J-Jansen commented 4 months ago

Some additional clock timing info which I use. I run my own in-house stratum-1 NTP server which I built myself on an RPI 2 with an GPS u-blox8 HAT receiving GPS signals from an extra outdoor antenna, so there is minimal clock difference between all my other boxes. The initial reason for this was because of my (successful) z/OS Base Sysplex experiments, where I thought this was going to be required (but which I now think was overkill).

Cheers,

Peter

jeff-snyder commented 4 months ago

Hi Peter,

Some good news, some remaining questions...

All my TSAF testing was using z/VM 7.3 on two hosts, a regular i7-6700K 4-core 32GB PC with Ubuntu 22.04, and a 8GB RPI4 with RPI OS "bullseye" (the official RPI Debian derivative).

It seems we're pretty similarly equipped. My Windows box is an i7-6700K 4-core 48GB CPU @ 4.00 GHz. Instead of Ubuntu, I'm running Windows 10 Pro.

The 3 PIs I've been testing with are all 8 GB PI 4's running bullseye.

Some additional clock timing info which I use. I run my own in-house stratum-1 NTP server which I built myself on an RPI 2 with an GPS u-blox8 HAT receiving GPS signals from an extra outdoor antenna, so there is minimal clock difference between all my other boxes. The initial reason for this was because of my (successful) z/OS Base Sysplex experiments, where I thought this was going to be required (but which I now think was overkill).

I have an a PI 2 that I use for network services. It gets time from the Internet and my other PIs get their time from it, so while it's not a stratum-1 source, it is at least common to all the PIs.

The send tsafvm q link commands only ever showed a delay of 15, which I presume to be 15 msecs. Longer running tests of close to two days never showed failures, but that is with idling TSAF's as I wouldn't know how to effectively use TSAF, so it just performs the sync's every 2 hours.

In my initial testing, I was getting much higher delay times and my main tests 2 days ago were alot better with most being 31 and one being 15. Over last night, I was able to retest everything (see below) and I got consistent 15s on all links.

I suspect TSAF nowadays is just left as is, as I think z/VM's SSI kind of replaces TSAF. CTCE supports SSI quite reliably, but SSI uses a more sophisticated CTCE setup, using the recommended 2 CTC's per link, so that each direction (send / receive) get's it's own dedicated CTC. My VTAM setup between 4 z/VM 7.2 SSI systems uses the 4 CTC's per link, using VTAM MPC's, i.e. 2 send/receive pairs for additional redundancy when a connection problem would develop.

I haven't looked into SSI, but it appears VM/ESA also supports it, since occasionally I'll mistype a "q collect" command for TSAF and get a CP response instead. If it supports file-sharing, it may be a good fit for my ESA and z VM requirements. Then I'll only have TSAF on the VM/SPs to deal with.

Nevertheless, I think the current CTCE fixes make VM's TSAF facility work fine, but within the limitations of what might be temporary intermittent communication problems. CTCE using TCP sockets supplying the basic communication cannot really implement the original native channel-to-channel devices when it comes to error recovery on intermittent communication problems. That's a restriction we may have to live with.

I would agree with that based on further PI testing I did last night, but I don't see how there would have been any network/communication errors on the collection that all ran on Windows.

Last night

Overnight, I restarted both the test scenarios I reported on 4/3. To put it briefly, they both ran very well.

A - the Windows scenario

The Windows-based 4-node collection came up cleanly and stayed up with no issues. I didn't make any changes to the environment or Hercules configurations, but I got much better results than last time. Honestly, I just thought I was going to play with the "clocks" command and once I got the Windows systems up to test it, I figured, "what the heck, let's see what happens overnight."

One HHC05079I message was logged:

HHC05079I 0:0801 CTCE: <- 0:0807 #0012 cmd=WRT=01 xy=aW<-aa l=03FC k=0DA008A5 Stat=80 CC=0 w=1,r=0 WAIT ATTN->RC=0 Busy_Waits=7

but it had no visible effect on TSAF and everything is still up and clean. I brought up SFSPOOL1 (the shared file server) today and I'll try passing some files around to see if that causes any issues.

B - the PI scenario

I moved the PIs to a Cisco 3560 switch. The 3560 is only 100Mbps and it's very noisy, so this is not a long-term solution, but proved to be a good test. The PIs ran all night with no drops or network errors of any kind and TSAF was stable. There was an issue initially with authentication errors that prevented me from bringing JS07 (z/VM 4.4) up with JS01 (VM/ESA 2.4) as the hub, but when I moved it to JS16 (z/VM 6.3), everything came up and stayed up all night. (Maybe there are some TSAF fixes on z/VM that came in after VM/ESA?) Today, I dynamically moved JS07 back to JS01 (i.e. add link to JS01, remove link from JS16), effectively making JS01 the hub again and I'll see how that runs. So far, no problems. The link delay times are all 15. Initially, one of the links was 31, but after a short time that changed to 15 as well. As long as this stays stable, I'll bring up SFSPOOL1 here, too, and play with it to see what happens.

In this scenario, more HHC05079I messages were logged, but none had any visible effect on TSAF. JS01 logged 12, with one at 00:10:46 being seemingly unrelated to any TSAF messages and the rest appearing to be the result of me starting the link to JS07 at 13:44 to make JS01 the new hub. JS16 logged 3 messages. One occurred during the start up of the link to JS07, another coincides with a "Synchronization is now NORMAL" message and the third doesn't match any Operator messages.

Summary

Note, in neither scenario was any logging or debugging turned on. I was able to capture Hercules and Operator logs, but not the extended information that would have been provided by the debug command.

Logs0405a.zip Logs0405b.zip

Questions

I guess my remaining question for this testing is why was the Windows scenario successful this time and not last? Is it possible the debug command introduces delay or some other factor and that this throws off CTCE? I'm assuming you tested with debug on as well, so that seems unlikely or you'd have seen it too. Beyond that, I can't come up with anything.

Also, since the VM/ESA and z/VM situations seem to be getting resolved, are there additional tests you'd like to see with the VM/SP systems? I've got TSAF available on VM/SP 6 now, so the blank messages that SP5 produced will hopefully be correlateable to something visible on SP6. Also, I'll soon have a shared file system and another SP6 image, so we could concentrate on testing with them, if there's something remaining to look into.

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi,

Windows I have restarted the Windows-based collection (JS01, JS04, JS07, JS16), with full debugging and VTAM running (VTAM is not on JS07) as well as TSAF. Hopefully, any issues will recur sometime this weekend and we'll catch them and be able to see what's going on. All links report a delay of 15.

PIs FYI, the PI-based collection (JS01, JS07, JS16) is still running, with no debugging and no apparent errors. All the links are up, with 1/2 of them reporting a delay of 15 and the rest 31. I moved some files back and forth with no issues. N.B. VTAM is also running in this scenario on nodes JS01 and JS16. The 3560 switch still reports all PI ports running clean, with no errors and no drops.

370-mode Barring any feedback, I'll hold off on any SP testing beyond my continuing efforts to get SFS working between multiple SP6s. Maybe after this weekend, if I experience problems while doing that, I'll start a test scenario for 370-mode VMs (SP5 and SP6).

SSI (ISFC) Also, based on some quick reading, it seems Single System Image (SSI) runs on top of Inter-System Facility for Communications (ISFC), which is supported by VM/ESA (but, unfortunately, not SP5 and SP6 only with maintenance that I may not have). TSAF and ISFC can co-exist or ISFC can be used exclusively to provide the services that TSAF and ISFC provide.

Per https://www.vm.ibm.com/ssi/

z/VM V6.2 introduced the Single System Image (SSI) as a priced feature. With z/VM 7.1, SSI is now included as part of the base.

All SSI clusters use ISFC for intra-cluster communication and live guest relocation. ISFC uses CTC devices.

Based on this, I might try playing with ISFC on the VM/ESA and z/VM OSes next, since Peter reports good results using SSI on z/VM. If that works, I could then limit my TSAF efforts to VM/SP.

Thanks, Jeff

jeff-snyder commented 4 months ago

Hi,

Here are summaries of my weekend testing.

Windows I have restarted the Windows-based collection (JS01, JS04, JS07, JS16), with full debugging and VTAM running (VTAM is not on JS07) as well as TSAF. Hopefully, any issues will recur sometime this weekend and we'll catch them and be able to see what's going on. All links report a delay of 15.

The TSAF links on Windows did not even stay up for an hour before they started dropping and needed Operator intervention to reconnect. The TSAF VM even abended, requiring all the links to be restarted. After that, the links stayed up for approximately 10 hours before the collection fell apart. In addition, at one point overnight, there was an error on a VTAM CTC link, causing it to fail.

PIs FYI, the PI-based collection (JS01, JS07, JS16) is still running, with no debugging and no apparent errors. All the links are up, with 1/2 of them reporting a delay of 15 and the rest 31. I moved some files back and forth with no issues. N.B. VTAM is also running in this scenario on nodes JS01 and JS16. The 3560 switch still reports all PI ports running clean, with no errors and no drops.

The PI-based tests were up longer, but still eventually failed. First, there were issues getting JS16 to join the collection on JS01 so I made JS16 the hub and had the others connect to it. It continued to run this way for over 12 hours before I added link 807 to JS01 and deleted it from JS16 to see if I could move the hub back to VM/ESA. The transition went smoothly and TSAF stayed up in this configuration for about 36 hours before the collection fell apart and all 3 nodes became isolated. No network drops occurred during any of this testing.

Summary At this point, I have lost all faith in TSAF. Trying to get TSAF to work has cost me more in time than I anticipated the remote file access would save me. It has become pretty obvious that I can't expect to have TSAF just run in the background all the time and have other systems connect to and drop from it like I currently do with VTAM. In the future, I am going to look into a couple other alternatives, including ISFC over CTC and AVS via VTAM, as ways to provide remote file access. I will probably have to use TSAF for the VM/SP systems, but if I do, it will be for short-duration, point-in-time solutions, instead of an always-on hub.

FYI, I have ordered a 3560cx switch to resolve the packet drops. It's a fanless 3560 that has gigabit ports, so it should be a great replacement for my Netgear switch. Having an intelligent switch as the core of my "production" network will be nice, too.

Peter, thanks for all your help. I'm glad you were able to get an enhancement to the CTCE adapter driver out of all this, at least. If there is anything else you'd like to try, please let me know. If and when I try TSAF on the VM/SPs, if I run into any issues, I'll let you know.

Thanks, Jeff

Peter-J-Jansen commented 4 months ago

Hi Jeff,

Thanks ! I understand my hopes for a perfectly stable TSAF-over-CTCE are not there yet, but nevertheless think this issue outcome was a positive one in that at least TSAF connections are now possible. I will leave this issue open for a few more days, but unless there are serious objections to it, I'd propose to close this issue after that,

Cheers,

Peter

jeff-snyder commented 4 months ago

Hi Peter,

This evening, I've been running TSAF between my 370-mode systems. I now have 2 VM/SP 5s and 2 VM/SP 6s all interconnected. I've only had an issue with one link out of the 6. All the links came up easily and only the one dropped, all the others have remained stable. I have SFS running on the two VM/SP6 systems and I can share files between them. As expected, the VM/SP5s can see the resource names in TSAF but have no way to access them.

JS02+JS10 - VM/SP5 JS06+JS11 - VM/SP6 The "missing" link (i.e. from JS06 to JS10) is the one that failed.

The only HHC05079I messages logged were on the failed link and they all appear to have been resets:

JS06:
2024-04-09 23:58:34 HHC05079I 0:0710 CTCE: <- 0:0706 #0077 cmd=RST=00 xy=AN<-wa l=0000 k=272B2476 Stat=0E      w=0,r=0 MATCH SENSE=4100 RESET
2024-04-09 23:58:34 HHC05079I 0:0710 CTCE: <- 0:0706 #0078 cmd=RST=00 xy=an<-an l=0000 k=272B2479              w=0,r=0 SENSE=4100 RESET
2024-04-09 23:58:53 HHC05079I 0:0710 CTCE: -> 0:0706 #0075 cmd=RST=00 xy=aa->Aa l=0000 k=273D0474              w=0,r=0 SENSE=4100 RESET
2024-04-09 23:58:53 HHC05079I 0:0710 CTCE: -> 0:0706 #0076 cmd=RST=00 xy=aa->aa l=0000 k=273D0477              w=0,r=0 SENSE=4100 RESET
2024-04-09 23:59:43 HHC05079I 0:0710 CTCE: <- 0:0706 #0079 cmd=RST=00 xy=aa<-aa l=0000 k=272B2478              w=0,r=0 SENSE=4100 RESET
2024-04-10 00:00:01 HHC05079I 0:0710 CTCE: -> 0:0706 #0077 cmd=RST=00 xy=aa->aa l=0000 k=273D0476              w=0,r=0 SENSE=4100 RESET
2024-04-10 00:00:51 HHC05079I 0:0710 CTCE: <- 0:0706 #007A cmd=RST=00 xy=aa<-aa l=0000 k=272B247B              w=0,r=0 SENSE=4100 RESET
2024-04-10 00:01:09 HHC05079I 0:0710 CTCE: -> 0:0706 #0078 cmd=RST=00 xy=aa->aa l=0000 k=273D0479              w=0,r=0 SENSE=4100 RESET

JS10:
2024-04-09 23:58:34 HHC05079I 0:0706 CTCE: -> 0:0710 #0077 cmd=RST=00 xy=wa->Na l=0000 k=272B2476              w=1,r=1 RESET
2024-04-09 23:58:34 HHC05079I 0:0706 CTCE: -> 0:0710 #0078 cmd=RST=00 xy=na->na l=0000 k=272B2479              w=1,r=1 RESET
2024-04-09 23:59:43 HHC05079I 0:0706 CTCE: -> 0:0710 #0079 cmd=RST=00 xy=na->na l=0000 k=272B2478              w=1,r=1 RESET
2024-04-10 00:00:51 HHC05079I 0:0706 CTCE: -> 0:0710 #007A cmd=RST=00 xy=na->na l=0000 k=272B247B              w=1,r=1 RESET

I did try to restart the link (well, "delete" and then "add" on both ends), with no success. I never got aggressive with varying it offline and doing devinits. Fortunately, I've got a multitude of paths around it!

I'm convinced I can use this for any "point-in-time" requirements. I'll leave it up overnight out of curiosity, but I expect it will probably fail in several hours, like the other tests did. As long as I don't need it for more than a few hours at a time though, I think I'm good.

Thanks for all your help with this! I'm happy with your plan to close this in a day or two. That'll give me a little leeway in case something happens with the VM/SPs I want to let you know about.

Jeff

jeff-snyder commented 4 months ago

Hi,

FYI, after about 22 hours, the two VM/SP 5 systems are the only ones left in the collection. All the links have failed except for one link between the 2 SP5 nodes.

I'm going to put my production network back up and just keep TSAF around for short-term use when needed. I'll close this issue now. Thanks!

Thanks, Jeff