SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
237 stars 89 forks source link

A misconfigured CTCE causes Hercules to crash #646

Closed jeff-snyder closed 4 months ago

jeff-snyder commented 4 months ago

Hi Peter,

The two systems JS01 and JS10 are configured to talk to each other over a CTC. Unfortunately, a typo was made on the JS01 configuration and instead of 610 as the remote CUU, E20 was entered.

JS01
610 CTCE 30801 E20=Hercules 30810 # link to JS10/E20

JS10
610 CTCE 30810 610=Hercules 30801 ATTNDELAY 200 # link to JS01/610

When I bring up Hercules for JS01 and IPL VM/ESA, there are no problems. A "devlist ctca" shows the link defined (among others).

2024-03-29 18:09:18 HHC01603I devlist ctca
2024-03-29 18:09:18 HHC02279I 0:0610 CTCE CTCE 30801/63504 !=! 0:0E20=192.168.1.32:30810/* IO[0] 

When I then bring up Hercules for JS10, it thinks it successfully connected to this link.

2024-03-29 18:10:48 HHC05063I 0:0610 CTCE: Awaiting inbound connection :30810 <- 0:0610=192.168.1.32:30801/*
2024-03-29 18:10:48 HHC05070I 0:0610 CTCE: Accepted inbound connection :30810 <- 0:0610=192.168.1.32:63830 (bufsize=62552,16)
2024-03-29 18:10:48 HHC05054I 0:0610 CTCE: Renewed outbound connection :63844 -> 0:0610=192.168.1.32:30801

The "devlist" for JS10 shows a connection.

2024-03-29 18:10:57 HHC01603I devlist ctca
2024-03-29 18:10:57 HHC02279I 0:0610 CTCE CTCE 30810/63844 <-> 0:0610=192.168.1.32:30801/63830 IO[0] open 

The log for JS01 shows it started an outbound connection, but there was never an inbound connection.

2024-03-29 18:10:48 HHC05054I 0:0610 CTCE: Started outbound connection :63830 -> 0:0E20=192.168.1.32:30810

The "devlist" confirms an incomplete link.

2024-03-29 18:11:17 HHC01603I devlist ctca
2024-03-29 18:11:17 HHC02279I 0:0610 CTCE CTCE 30801/63830 !=> 0?0E20=192.168.1.32:30810/* IO[2] open 

When I IPL JS10, errors ensue:

2024-03-29 18:11:56 HHC01603I ipl 1c0
2024-03-29 18:11:56 HHC05074E 0:0610 CTCE: Error writing to 0:0610=192.168.1.32:30801/63830: An established connection was aborted by the software in your host machine.
2024-03-29 18:11:56 HHC00007I Previous message from function 'CTCE_Send' at ctcadpt.c(2555)
2024-03-29 18:11:56 HHC05086I 0:0610 CTCE: Recovery is about to issue Hercules command: DEVINIT 0:0610
2024-03-29 18:12:31 HHC00822S PROCESSOR CP00 APPEARS TO BE HUNG!

and, eventually, a crash dump.

Note, this happened on Windows 10, running Hercules version 4.8.0.11129-SDL-DEV-g5517d322-modified I retested with version Hercules version 4.8.0.11129-SDL-DEV-g5517d322, i.e. without the changes to ctcadpt.c, and it still happens.

JS01 is VM/ESA 2.4 and JS10 is VM/SP 5.

Here are the associated log and config files.

Unfortunately, due to the 74 MB file size, I cannot upload the dump file. For now, I have put it on my Google drive. Hopefully, you can get it from there or we can find another way to get it to you.

Thanks for looking at this! Jeff

Peter-J-Jansen commented 4 months ago

Hi Jeff,

The CTCE recovery attempts are known to not always end successfully. The DEVINIT 0:0610 attempt may very well fail when the device by then is busy or has an interrupt pending. That in this case it caused a crash was probably due to the Hercules watchdog timer discovering HHC00822S PROCESSOR CP00 APPEARS TO BE HUNG!. So this crash was probably a case of Works As Desgined ("WAD").

Some years ago numerous efforts were spent on making the CTCE automatic recovery's fail-safe, and progress was made, but no, I was unable to make it work in all cases. As this occurrence was started by an incorrect CTCE configuration, I'd suggest we close this Issue. As some additional help avoiding CTCE configuration errors, I'd suggest to not specify any port numbers at all when the CTCE links are between difference hosts, but just restrict the configuration to just use device (CCUU) numbers, e.g.:

0610 CTCE =Hercules

or if one prefers the device number host-side specific:

0610 CTCE 0601=js01.hostname

0601 CTCE 0610=js10.hostname

Cheers,

Peter

jeff-snyder commented 4 months ago

Peter,

Some years ago numerous efforts were spent on making the CTCE automatic recovery's fail-safe, and progress was made, but no, I was unable to make it work in all cases. As this occurrence was started by an incorrect CTCE configuration, I'd suggest we close this Issue.

I'm good with that. I have a work around (i.e. fix your stupid configuration error!).

As some additional help avoiding CTCE configuration errors, I'd suggest to not specify any port numbers at all when the CTCE links are between difference hosts, but just restrict the configuration to just use device (CCUU) numbers, e.g.:

Unfortunately, this doesn't work for me because I run multiple Hercules images on each host and I move them around, so I'm never sure which images will be running on which hosts. It's a good solution for people with fewer images or a more stable environment, theough!

Thanks, Jeff