SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
246 stars 92 forks source link

A random failure can occur when issuing the ARCHMODE and NUMCPU commands #542

Closed wrljet closed 1 year ago

wrljet commented 1 year ago

A random failure can occur when issuing the ARCHMODE and NUMCPU commands.

While running Hercules in a shell script, many times over-and-over, trying to track down a different issue, I started noticing these errors in the log, which later caused IPL to fail.

HHC02389E CPUs must be offline or stopped or HHC02253E All CPU's must be stopped to switch architectures

A simple Hercules .cnf can be used when attempting to reproduce this problem:

ARCHLVL z/Arch
FACILITY DISABLE 050_CONSTR_TRANSACT
FACILITY DISABLE 073_TRANSACT_EXEC
NUMCPU 2
MAXCPU 2

Depending on the host system's CPU architecture, OS, etc. this problem may trigger quickly, perhaps one out of ten tries (NetBSD on x86_64 and Sun UltraSPARC). Or it may refuse to ever act up (ARM based Raspberry Pi with Debian, and macOS on Apple M1 CPU). Windows and Debian on x86_64 fail fairly regularly for me.

After endless fiddling I did manage to get it to stop in the Visual Studio debugger (Windows 10 VM, VS2019) and noticed the two threads involved that are the "what" of the issue.

Worker Thread   impl_thread hengine.dll!maxcpu_cmd

>   hengine.dll!maxcpu_cmd(int argc, char * * argv, char * cmdline) Line 3811   C
    hengine.dll!CallHercCmd(int argc, char * * argv, char * cmdline) Line 362   C
    hengine.dll!process_config(const char * cfg_name) Line 424  C
    hengine.dll!build_config(const char * hercules_cnf) Line 118    C
    hengine.dll!impl(int argc, char * * argv) Line 1340 C
    hercules.exe!main(int ac, char * * av) Line 305 C

Worker Thread   Processor CP01  hutil.dll!LeaveFT_MUTEX

    hutil.dll!LeaveFT_MUTEX(_tagFT_MUTEX * pFT_MUTEX) Line 292  C
    hutil.dll!fthread_mutex_unlock(_tagFTU_MUTEX * pFTUSER_MUTEX) Line 1459 C
    hutil.dll!hthread_release_lock(LOCK * plk, const char * release_loc) Line 545   C
    hengine.dll!Release_Interrupt_Lock(REGS * regs, const char * location) Line 450 C
>   hengine.dll!z900_run_cpu(int cpu, REGS * oldregs) Line 1996 C
    hengine.dll!cpu_thread(void * ptr) Line 2355    C
    hutil.dll!hthread_func(void * arg2) Line 1055   C
    hutil.dll!FTWin32ThreadFunc(void * pMyArgs) Line 809    C

Some of the relevant code:

cpu.c:1926
      memset(regs, 0, sizeof(REGS));

        if (cpu_init (cpu, regs, NULL))
            return NULL;

...

cpu.c:1991
RELEASE_INTLOCK(regs);

    /* Establish longjmp destination for program check or
       RETURN_INTCHECK, or SIE_INTERCEPT, or longjmp, etc.
    */
    if (setjmp( regs->progjmp ) && sysblk.ipled)
    {

---

in cpu_init( )

cpu.c:
if (!hostregs)
    {
        /* regs points to host regs */
        regs->cpustate = CPUSTATE_STOPPING;
        ON_IC_INTERRUPT(regs);

This bug affects Hercules going back at least 2 years in the git commit history.

I have reported this bug to Fish privately and worked with him to help reproduce it. I have tested his proposed fix, which is forthcoming.

Bill

Fish-Git commented 1 year ago

Fixed by commit f83880b58baadc28de75686b5f7a3800efa57996.

Closing.

wrljet commented 1 year ago

Fish, excellent work!