SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

Hercules crash after resume from suspend #458

Closed Fish-Git closed 2 years ago

Fish-Git commented 2 years ago

NOTE:  GitHub Issue #489 (Hercules 4.4.1 crashes after OSA failure) is also closely related to this issue.

NOTE:  GitHub Issue #449 (Periodic "HHC00801I Processor CP00: Fixed-point-divide exception running VSE/ESA) is also closely related to this issue.


Problem reported by Dave Wade:

Dear Developers,

Hercules occasionally crashes after my laptop exits windows sleep mode. Is this something that could be investigated . It reports its version 4.3.0.10296.

As I seldom run anything other than VM/370 R6 I haven’t yet upgraded to the latest version.

Dave

He provided a crash dump:

Fish-Git commented 2 years ago

Analysis of the crash dump reveals:


Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS\Hercules.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available

WARNING: Minidump contains unknown stream type 0x15
WARNING: Minidump contains unknown stream type 0x16
Comment: 'HHC01413I Hercules version 4.3.0.10296-SDL-ga4db8213 (4.3.0.10296)'
Comment: 'HHC01414I (C) Copyright 1999-2020 by Roger Bowler, Jan Jaeger, and others'
Comment: 'HHC01417I ** The SoftDevLabs version of Hercules **'
Comment: 'HHC01415I Build date: Nov 27 2020 at 10:51:04'
Comment: 'HHC01417I Built with: Microsoft Visual Studio 2008 (MSVC 150030729 1)'
Comment: 'HHC01417I Build type: Windows MSVC AMD64 host architecture build'
Comment: 'HHC01417I Modes: S/370 ESA/390 z/Arch'
Comment: 'HHC01417I Max CPU Engines: 64'
Comment: 'HHC01417I Using   shared libraries'
Comment: 'HHC01417I Using   Fish threads Threading Model'
Comment: 'HHC01417I Using   Error-Checking Mutex Locking Model'
Comment: 'HHC01417I With    Shared Devices support'
Comment: 'HHC01417I With    Dynamic loading support'
Comment: 'HHC01417I With    External GUI support'
Comment: 'HHC01417I With    Partial TCP keepalive support'
Comment: 'HHC01417I With    IPV6 support'
Comment: 'HHC01417I With    HTTP Server support'
Comment: 'HHC01417I With    sqrtl support'
Comment: 'HHC01417I With    Signal handling'
Comment: 'HHC01417I With    Watchdog monitoring'
Comment: 'HHC01417I With    CCKD BZIP2 support'
Comment: 'HHC01417I With    HET BZIP2 support'
Comment: 'HHC01417I With    ZLIB support'
Comment: 'HHC01417I With    Regular Expressions support'
Comment: 'HHC01417I With    Object REXX support'
Comment: 'HHC01417I Without Regina REXX support'
Comment: 'HHC01417I With    Automatic Operator support'
Comment: 'HHC01417I Without National Language Support'
Comment: 'HHC01417I With    CCKD64 Support'
Comment: 'HHC01417I With    Transactional-Execution Facility support'
Comment: 'HHC01417I With    "Optimized" instructions'
Comment: 'HHC01417I Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8 cmpxchg16 hatomics=msvcIntrinsics'
Comment: 'HHC01417I Running on: DESKTOP-OSERMAE (Windows-6.2.9200 Intel(R) x64) LP=4, Cores=2, CPUs=1'
Comment: 'HHC01417I Built with crypto external package version 1.0.0.31-gbf46d03'
Comment: 'HHC01417I Built with decNumber external package version 3.68.0.84-g94144c9'
Comment: 'HHC01417I Built with SoftFloat external package version 3.5.0.87-gb706677'
Comment: 'HHC01417I Built with telnet external package version 1.0.0.48-gb90a708'
Comment: 'HHC00018W Hercules is NOT running in elevated mode'
Comment: '10:12:32.299449 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]'
Comment: '10:12:32.299596 0:0141 cckd_dasd_end(512): end i/o bufcur 2390 cache[27] waiters 0'
Comment: '10:12:32.299841 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6020 cache[89]'
Comment: '10:12:32.300034 0:034F cckd_dasd_end(512): end i/o bufcur 6020 cache[89] waiters 0'
Comment: '10:12:32.300264 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]'
Comment: '10:12:32.300423 0:0141 cckd_dasd_end(512): end i/o bufcur 2390 cache[27] waiters 0'
Comment: '10:12:32.300578 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6020 cache[89]'
Comment: '10:12:32.300693 0:034F cckd_read_track(866): read  trk   6021 (asynchronous)'
Comment: '10:12:32.300702 0:034F cckd_read_trk(1194): 0 rdtrk     6021'
Comment: '10:12:32.301348 0:034F cckd_read_trk(1242): 0 rdtrk[150] 6021 cache hit buf 0000000004CA67A0:0000c80015'
Comment: '10:12:32.301391 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.301674 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]'
Comment: '10:12:32.301786 0:0141 cckd_read_track(866): read  trk   2391 (asynchronous)'
Comment: '10:12:32.301792 0:0141 cckd_read_trk(1194): 0 rdtrk     2391'
Comment: '10:12:32.302280 0:0141 cckd_read_trk(1242): 0 rdtrk[28] 2391 cache hit buf 0000000004CE6480:00004f0015'
Comment: '10:12:32.302685 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.313596 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.313759 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.313940 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]'
Comment: '10:12:32.314142 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.314339 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.314486 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.314649 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]'
Comment: '10:12:32.315347 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.315475 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.315744 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.315972 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]'
Comment: '10:12:32.316161 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.316323 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.316434 0:034F cckd_read_track(866): read  trk   6022 (asynchronous)'
Comment: '10:12:32.316439 0:034F cckd_read_trk(1194): 0 rdtrk     6022'
Comment: '10:12:32.316961 0:034F cckd_read_trk(1242): 0 rdtrk[120] 6022 cache hit buf 0000000004A9B8C0:0000c80016'
Comment: '10:12:32.317002 0:034F cckd_dasd_end(512): end i/o bufcur 6022 cache[120] waiters 0'
Comment: '10:12:32.322752 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]'
Comment: '10:12:32.323013 0:07F0 cckd_dasd_end(512): end i/o bufcur 11490 cache[180] waiters 0'
Comment: '10:12:32.328969 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]'
Comment: '10:12:32.329318 0:07F0 cckd_dasd_end(512): end i/o bufcur 11490 cache[180] waiters 0'
Comment: '10:12:32.351690 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]'
Comment: '10:12:32.351845 0:07F0 cckd_read_track(866): read  trk   11491 (asynchronous)'
Comment: '10:12:32.351850 0:07F0 cckd_read_trk(1194): 0 rdtrk     11491'
Comment: '10:12:32.351859 0:07F0 cckd_read_trk(1242): 0 rdtrk[181] 11491 cache hit buf 0000000004B25690:00017f0001'
Comment: '10:12:32.351972 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:32.352374 0:07F0 cckd_read_trk(1194): 1 rdtrk     11492'
Comment: '10:12:32.352385 0:07F0 cckd_read_trk(1273): 1 rdtrk[182] 11492 cache miss'
Comment: '10:12:32.352391 0:07F0 cckd_read_trk(1302): 1 rdtrk[182] 11492 dropping 0241:30 from cache'
Comment: '10:12:32.352396 0:07F0 cckd_read_trk(1325): 1 rdtrk[182] 11492 buf 0000000004A60C20 len 19456'
Comment: '10:12:32.352401 0:07F0 cckd_read_trkimg(3333): trk[11492] read_trkimg'
Comment: '10:12:32.352817 0:07F0 cckd_read_l2ent(3250): file[1] l2[44,228] trk[11492] read_l2ent 0x13d08'
Comment: '10:12:32.353219 0:07F0 cckd_read_l2(3009): file[1] read_l2 44 active 1 44 144'
Comment: '10:12:32.353640 0:07F0 cckd_read_l2ent(3268): file[1] l2[44,228] trk[11492] read_l2ent 0x185c57 189 189'
Comment: '10:12:32.353647 0:07F0 cckd_read(624): file[1] fd[20] read, off 0x0000000000185c57 len 189'
Comment: '10:12:32.354346 0:07F0 cckd_read_trk(1366): 1 rdtrk[182] 11492 complete buf 0000000004A60C20:01017f0002'
Comment: '10:12:36.017050 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.017270 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:36.023011 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.023265 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:36.024658 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.024907 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:36.027040 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.027313 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:43.774786 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:43.774950 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:43.775210 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:43.775390 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: ''
Symbol search path is: C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS;srv*C:\Windows\Symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS
Windows 7 Version 19043 MP (4 procs) Free x64
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Sat Dec 25 02:13:03.000 2021 (UTC - 8:00)
System Uptime: not available
Process Uptime: 2 days 0:49:02.000
.........................................................
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(8440.7bc8): Access violation - code c0000005 (first/second chance not available)
ntdll!ZwGetContextThread+0x14:
00007ffa`3c24eba4 c3              ret
0:003> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

Failed calling InternetOpenUrl, GLE=12029

FAULTING_IP: 
hengine!watchdog_thread+486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586]
00000001`005db506 c6450000        mov     byte ptr [rbp],0

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 00000001005db506 (hengine!watchdog_thread+0x0000000000000486)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000001
   Parameter[1]: 0000000000000000
Attempt to write to address 0000000000000000

DEFAULT_BUCKET_ID:  NULL_POINTER_WRITE

PROCESS_NAME:  hercules.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1:  0000000000000001

EXCEPTION_PARAMETER2:  0000000000000000

WRITE_ADDRESS:  0000000000000000 

FOLLOWUP_IP: 
hengine!watchdog_thread+486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586]
00000001`005db506 c6450000        mov     byte ptr [rbp],0

MOD_LIST: 

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

FAULTING_THREAD:  0000000000007bc8

PRIMARY_PROBLEM_CLASS:  NULL_POINTER_WRITE

BUGCHECK_STR:  APPLICATION_FAULT_NULL_POINTER_WRITE

LAST_CONTROL_TRANSFER:  from 0000000000c60450 to 00000001005db506

STACK_TEXT:  
00000000`02dff500 00000000`00c60450 : 00000000`00dbf130 00000000`00000000 00000000`00000003 00000000`7055334a : hengine!watchdog_thread+0x486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586]
00000000`02dff790 00000000`00c53c04 : 00000000`00dc03f0 00000000`00000000 00000000`00000000 00000000`00000000 : hutil!hthread_func+0x60 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\hthreads.c @ 1051]
00000000`02dff7c0 00000000`70552fdf : 00000000`00dc1950 00000000`00000000 00000000`00dc1950 00000000`00000000 : hutil!FTWin32ThreadFunc+0x44 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\fthreads.c @ 809]
00000000`02dff810 00000000`70553080 : 00000000`705dcad0 00000000`00dc1950 00000000`00000000 00000000`00000000 : msvcr90!_callthreadstartex+0x17 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 348]
00000000`02dff840 00007ffa`3bbc7034 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr90!_threadstartex+0x84 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 326]
00000000`02dff870 00007ffa`3c202651 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
00000000`02dff8a0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21

STACK_COMMAND:  ~3s; .ecxr ; kb

FAULTING_SOURCE_CODE:  
   582:             /* Give logger thread time to log messages */
   583:             SLEEP(1);
   584: 
   585:             /* Create the crash dump for offline analysis */
>  586:             CRASH();
   587:         }
   588:     }
   589:     while (!sysblk.shutdown);
   590: 
   591:     return NULL;

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  hengine!watchdog_thread+486

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: hengine

IMAGE_NAME:  hengine.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  5fc14ab7

FAILURE_BUCKET_ID:  NULL_POINTER_WRITE_c0000005_hengine.dll!watchdog_thread

BUCKET_ID:  X64_APPLICATION_FAULT_NULL_POINTER_WRITE_hengine!watchdog_thread+486

WATSON_STAGEONE_URL:  http://watson.microsoft.com/StageOne/hercules_exe/4_3_0_10296/5fc14ad2/hengine_dll/4_3_0_10296/5fc14ab7/c0000005/001db506.htm?Retriage=1

Followup: MachineOwner
---------

Which makes perfect sense!

The WatchDog thread checks for malfunctioning CPUs by checking to see if they've executed any instructions in the past 20 seconds, and when the host is later resumed after having been suspended, much longer than 20 seconds has passed! (thereby misleading the watchdog thread into believing the CPU is malfunctioning when in fact is likely wasn't!)

New code needs to be added to Hercules (at least for Windows anyway; not sure how to do it for non-Windows) to detect when a resume after a suspend has occurred, and compensate for that fact to prevent erroneous malfunctioning CPUs from being detected.

dasdman commented 2 years ago

Fish:

See https://stackoverflow.com/questions/7253714/get-notified-about-windows-is-going-to-sleep-waking-up-in-c for Windows; on Linux, DBUS service notification will most likely be needed (https://hg.mozilla.org/integration/autoland/rev/aa2f75bd6480 is Mozilla’s implementation).

Regards,

Mark L. Gaubatz

Fish-Git commented 2 years ago

Thanks anyway Mark, but I'm already on it. Having been a Windows programmer for 25+ years now, I DO know about these things after all.   :)

(Now the Linux side of things I could use some help with!)

Fish-Git commented 2 years ago

QUESTION FOR MY FELLOW HERCULES DEVELOPERS:

Before I begin testing, which fix technique do you prefer? Technique #1 or #2?

Technique #1 is host operating system dependent and would obviously require adding additional code to deal with the non-Windows case (only the Windows case has been coded), whereas technique #2 has the advantage of not being specific to any host os (i.e. should work equally well on both Windows and non-Windows systems) and is much simpler and more straightforward as well.

I realize it's obvious the second technique is clearly superior to the first as far as resolving the specific issue at hand goes, but I guess what I'm asking is, how important is it for us to be able to detect host system suspend/resume events in general?

The first technique has the advantage(?) of directly detecting such events, thereby allowing us the ability react differently to such events elsewhere in Hercules code too (and not just in the watchdog thread), whereas the second technique only fixes the one specific suspend/resume problem that we've discovered thus far (in the watchdog thread). There might well be other places in Hercules code where we need to compensate for suspend/resume that we just haven't discovered yet, and by using technique #1, such code would simply need to check the sysblk flags. The groundwork of actually directly detecting suspend/resume events would already have been laid.

Personally, I vote for technique #2.

Fish-Git commented 2 years ago

Fixed (hopefully!) by commit 4b439ae673d776ecaa35ed0aa87bb73f940e904e.

Closing.