SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
245 stars 92 forks source link

TXF problem with an Unconstrained TBEGIN transaction under SIE. #315

Closed Peter-J-Jansen closed 4 years ago

Peter-J-Jansen commented 4 years ago

z/OSMF in z/OS 2.4 under the latest SDL-Hyperion (version 4.3.9999.0-SDL-gbca5d05c (4.3.9999.0)) reproducibly fails in an Unconstrained TBEGIN transaction. The Hercules log shows:

16:16:45 HHC97703D TXF: CP03: Failed Outermost Uncons Transaction for TND 1 txfs=2 retry=0: TAC_UPGM (PGM Interruption (Unfiltered)), why = TXF_WHY_UNFILT_INT
16:16:45 HHC17709D TXF: CP03: TDB at 0x0000000000001800:
16:16:45 HHC17710D TXF: CP03: +0000  0100000000000001 0000000000000004  ................
16:16:45 HHC17710D TXF: CP03: +0010  0000000000000000 00000050130C7732  ...........&....
16:16:45 HHC17710D TXF: CP03: +0020  0000000000000000 0000000000000000  ................
16:16:45 HHC17710D TXF: CP03: +0030  0000000000000000 0000000000000000  ................
16:16:45 HHC17710D TXF: CP03: +0040  0000000000000000 0000000000000000  ................
16:16:45 HHC17710D TXF: CP03: +0050  0000000000000000 0000000000000000  ................
16:16:45 HHC17710D TXF: CP03: +0060  0000000000000000 0000000000000000  ................
16:16:45 HHC17710D TXF: CP03: +0070  0000000000000000 0000000000000000  ................
16:16:45 HHC17710D TXF: CP03: +0080  0000000000000000 00000000800E9BF0  ...............0
16:16:45 HHC17710D TXF: CP03: +0090  0000005010E2BBF0 000000000000000E  ...&.S.0........
16:16:45 HHC17710D TXF: CP03: +00A0  00000050D46FE1C0 0000000020829270  ...&M?.{.....bk.
16:16:45 HHC17710D TXF: CP03: +00B0  0000005013042D90 000000002059F300  ...&..........3.
16:16:45 HHC17710D TXF: CP03: +00C0  000000000000000B 00000000205D4200  .............)..
16:16:45 HHC17710D TXF: CP03: +00D0  00000000205D4200 00000000802057C8  .....).........H
16:16:45 HHC17710D TXF: CP03: +00E0  00000000BEF3D060 0000000020823900  .....3}-.....b..
16:16:45 HHC17710D TXF: CP03: +00F0  000000007B00B948 00000050130C76FA  ....#......&....
16:16:45 HHC90013E 'obtain_lock(&sysblk.txf_lock2[  3])' failed: rc=35: deadlock; tid=00007f90c1677700, loc=transact.c:673
16:16:45 HHC00007I Previous message from function 'loglock' at hthreads.c(104)
16:16:45 HHC90014I lock &sysblk.txf_lock2[  3] was obtained by thread 00007f90c1677700 at transact.c:673      

At the same time, the z/OS log shows:

20207 16:16:47.05 STC02040 00000090  JVMDUMP039I Processing dump event "gpf", detail "" at 2020/07/25 14:16:46 - please wait.
20207 16:16:47.15 STC02040 00000090  JVMDUMP032I JVM requested System dump using 'IZUSVR.JVM.IZUSVR1.D200725.T141646.X&DS' in response to an event
20207 16:16:47.38 STC02040 00000090  JVMDUMP032I JVM requested Java dump using '/S0W1/var/zosmf/data/logs/zosmfServer/javacore.
                                                 20200725.141646.16842779.0002.txt' in response to an even
20207 16:16:52.55 STC02040 00000090  JVMDUMP032I JVM requested Snap dump using '/S0W1/var/zosmf/data/logs/zosmfServer/Snap.
                                                 20200725.141646.16842779.0003.trc' in response to an event

The Javacore dump starts with:

0SECTION       TITLE subcomponent dump routine                                                                        
NULL           ===============================                                                                        
1TICHARSET     IBM-1047                                                                                               
1TISIGINFO     Dump Event "gpf" (00002000) received                                                                   
1TIDATETIME    Date: 2020/07/25 at 14:16:47:398                                                                       
1TINANOTIME    System nanotime: 1595686607401062045                                                                   
1TIFILENAME    Javacore filename:    /S0W1/var/zosmf/data/logs/zosmfServer/javacore.20200725.141646.16842779.0002.txt 
1TIREQFLAGS    Request Flags: 0x1 (exclusive)                                                                         
1TIPREPSTATE   Prep State: 0x100 (trace_disabled)                                                                     
1TIPREPINFO    Exclusive VM access not taken: data may not be consistent across javacore sections                     
NULL           ------------------------------------------------------------------------                               
0SECTION       GPINFO subcomponent dump routine                                                                       
NULL           ================================                                                                       
2XHOSLEVEL     OS Level         : z/OS 02.04.00                                                                       
2XHCPUS        Processors -                                                                                           
3XHCPUARCH       Architecture   : s390x                                                                               
3XHNUMCPUS       How Many       : 6                                                                                   
3XHNUMASUP       NUMA is either not supported or has been disabled by user                                            
NULL                                                                                                                  
1XHEXCPCODE    J9Generic_Signal_Number: 00000018                                                                      
1XHEXCPCODE    Signal_Number: 0000000B                                                                                
1XHEXCPCODE    Error_Value: 00000000                                                                                  
1XHEXCPCODE    Signal_Code: 000000FF                                                                                  
1XHEXCPCODE    Handler1: 000000500871F5C0                                                                             
1XHEXCPCODE    Handler2: 00000050088097C0                                                                             
NULL                                                                                                                  
NULL                                                                                                                  
1XHREGISTERS   Registers:                                                                                             
2XHREGISTER      gpr0: 0000000000000000                                                                               
2XHREGISTER      gpr1: 00000000800E9BF0                                                                               
2XHREGISTER      gpr2: 0000005010E2BBF0    

From looking at the code involving OPTION_TXF_SINGLE_THREAD the impression I get is that it does not (yet) really cater for Unconstrained transactions, and a quick test with a build without this option producing identically failing results seems to confirm it.

Thanks,

Peter

Fish-Git commented 4 years ago

Okay, I'm confused!

First of all, I can't find anywhere in Hercules where the following message is being issued:

HHC97703D TXF: CP03: Failed Outermost Uncons Transaction for TND 1 txfs=2 retry=0: TAC_UPGM (PGM Interruption (Unfiltered)), why = TXF_WHY_UNFILT_INT

The reported message number is HHC97703 but I can't find that message number anywhere in Hercules. I can certainly see message number HHC17703 being issued on line 1084 of transact.c of course, but I can't see anywhere where message number HHC97703 is being issued.

To further add to the confusion, where the frick is the text txfs=2 retry=0 coming from? I can't seem to find that text anywhere in all of Hercules either (neither txfs= nor retry=).   :(

I'm hoping that maybe this was a new/temporary debugging message that maybe you added yourself, Peter? (Please say yes! Please tell me that's true! Because if it's not true I think I might honestly be going insane!)

But the following lock messages are of course my greatest concern:

HHC90013E 'obtain_lock(&sysblk.txf_lock2[  3])' failed: rc=35: deadlock; tid=00007f90c1677700, loc=transact.c:673
HHC90014I lock &sysblk.txf_lock2[  3] was obtained by thread 00007f90c1677700 at transact.c:673

Line 673 of transact.c is where the lock in being obtained by the TBEGINC (Begin Constrained Transaction) instruction. Yet, the transaction that reportedly failed (according to the mysterious HHC97703D message) was an UNconstrained transaction!  (wtf?!)

Unless....................  (the penny finally drops!)

Ah-HA! Of course! It was a constrained transaction nested within an UNconstrained transaction! Doh! (me: smacks forehead) It seems so obvious now. I feel so stupid for not catching that right away.   :(

Anyway, I think the fix for this is likely (hopefully!) going to be fairly easy and obvious. Let me jump right into it and I'll get back to you. Thanks for reporting it.

Peter-J-Jansen commented 4 years ago

Hi Fish,

Thanks for your quick response!

HHC97703 is a modified version of HHC17703. The options good and bad on the txf command are good but both produce many loggings so I though to hardcode a version only logging what I consider focussed loggings, hence my HHC97703 modification:

--- /home/hercules/Qsync/Source/Originals/SDL-hyperion/transact.c   2020-07-24 16:49:03.000000000 +0200
+++ /home/hercules/SDL-hyperion/transact.c  2020-07-25 16:06:44.000000000 +0200
@@ -1070,7 +1070,9 @@
     /*  Log the failure if debugging enabled       */
     /*---------------------------------------------*/

-    if (TXF_TRACE( FAILURE, txf_contran ))
+    if (TXF_TRACE( FAILURE, txf_contran ) || regs->txf_caborts > 1
+        || ( txf_tac != TAC_EXT && txf_tac != TAC_IO && txf_tac != TAC_FETCH_CNF && txf_tac != TAC_STORE_CNF &&
+           ( regs->txf_why & ( ~( TXF_WHY_CSP_INSTR | TXF_WHY_DELAYED_ABORT | TXF_WHY_SIE_EXIT ) ) ) ) )
     {
         /* Report the reason WHY the transaction was aborted */
         char why[ 256 ] = {0};
@@ -1081,9 +1083,10 @@
             STRLCPY( why, " ?" );

         // "TXF: %s%02X: %sFailed %s %s Transaction for TND %d: %s (%s), why =%s"
-        WRMSG( HHC17703, "D", TXF_CPUAD( regs ), TXF_QSIE( regs ),
+#define        HHC97703 "TXF: %s%02X: %sFailed %s %s Transaction for TND %d txfs=%d retry=%d: %s (%s), why =%s"
+        WRMSG( HHC97703, "D", TXF_CPUAD( regs ), TXF_QSIE( regs ),
             txf_tnd > 1 ? "Nested" : "Outermost",
-            txf_contran ? "Cons" : "Uncons", txf_tnd,
+            txf_contran ? "Cons" : "Uncons", txf_tnd, sysblk.txf_transcpus + 1, regs->txf_caborts,
             tac2short( txf_tac ), tac2long( txf_tac ), why );

         /* If this is a delayed abort, log who detected/requested it */

The txfs and retry just add some additional info. The focus is on transaction aborts that should ideally never occur, i.e. Constrained transactions with more than 1 retry (cabort), or transactions aborted due to events that should never happen, hence the txf_tac and txf_why filtering.

I have some questions regarding the OPTION_TXF_SINGLE_THREAD option.

  1. Am I correct to assume that it would only apply to TBEGINC i.e. Constrained transactions?
  2. Am I correct to assume that the option applies to each and every CPU?
  3. Would the current TXF implementation even ever work without that option if z/OS (or any other O.S.) would actually schedule a CPU to execute a second (or n-th) transaction on a CPU which is already executing one, but for example interrupted due to a page fault?

I have even more questions, but that's for another topic.

Cheers,

Peter

Fish-Git commented 4 years ago

HHC97703 is a modified version of HHC17703.

Thought so.   :)

I have some questions regarding the OPTION_TXF_SINGLE_THREAD option.

  1. Am I correct to assume that it would only apply to TBEGINC i.e. Constrained transactions?

Yes.

2. Am I correct to assume that the option applies to each and every CPU?

Yes. That's what ......... er, ...... um, ...... (Oops!)

Boy that was stupid of me, wasn't it?   (sheesh!)

The whole point of the option (and obviously the lock!) was to prevent all other CPUs from executing an unconstrained transaction while any other CPU was executing theirs. But I can see now that it's not doing that! I incorrectly coded it such that each CPU has their own lock! THAT'S WRONG!  (DUH!)

There should be only ONE master TXF lock for ALL CPUs! That's the only way to prevent another CPU from trying to start their own unconstrained transaction after some other CPU has already started theirs! Duh!

(Dumb, Fish! Dumb, dumb, DUMB!! I can't believe I did that! That was really stupid!)   :confounded:

Sorry about that. I'll get that fixed right away. Thank you for so delicately bringing it to my attention.

3. Would the current TXF implementation even ever work without that option if z/OS (or any other O.S.) would actually schedule a CPU to execute a second (or n-th) transaction on a CPU which is already executing one, but for example interrupted due to a page fault?

Um, I'm not sure I understand your question.

You can have as many concurrent transactions as you want (with the upper limit being the number of CPUs you have of course), but you obviously can't have more than transaction running at the same time on the same CPU. You can only have ONE transaction active at any one time on any given CPU. When a transaction is already running on a given CPU and it attempts to start a "new" transaction before the first one completes, it's not actually a new transaction. It's just another level of the same transaction. A nested transaction. But it's only the outermost first transaction that matters. If anything goes wrong before it reaches its end, then that entire single one transaction (of which all inner transactions are simply a part of) is aborted (which effectively aborts all inner/nested transaction as well but it of course is only the outermost original transaction whose abort PSW matters).

Does that answer your question?

Anyway, I'm currently working on your bug. I was able to reproduce it, thankfully. It didn't actually occur until IZUSVR1 actually finished executing several minutes later, but it did occur, and it looked exactly like yours did.

Right now I'm adding some more debug code for when a transaction aborts. Right now we have the option of displaying the associated TDB whenever a transaction aborts, but it's formatted as a simple storage dump. I'm adding some new debug code that will produce a formatted display of the individual TDB fields instead (including displaying the instruction that the transaction aborted on, the ATIA).

I also want to create a new standalone runtest test to test executing a constrained transaction within an unconstrained transaction too, just like IZUSVR1 is doing, but without having to wait several minutes for that condition to occur.

(We need to start thinking about how we're going to TEST this dang thing folks! Using z/OS and z/VM like we are, while necessary (to ensure current IBM operating systems behave properly under Hercules), is not the proper/correct way to prove that TXF is functioning correctly!)

That's it. I'll post a followup comment once I have my fix ready.

Peter-J-Jansen commented 4 years ago
  1. Would the current TXF implementation even ever work without that option if z/OS (or any other O.S.) would actually schedule a CPU to execute a second (or n-th) transaction on a CPU which is already executing one, but for example interrupted due to a page fault?

Um, I'm not sure I understand your question.

Your answer to my 2nd question actually made this last 3rd question superfluous. Only now do I realize that any interruption during a transaction causes a transaction abort, so that, apart from a nested transaction, the same CPU can never be executing a different transaction in another thread simultaneously with a first one.

I agree fully with you that a standalone runtest for TXF (or even different runtests for various TXF scenario's) would be ideal to have. I have been thinking about this, but so far must admit this looks awfully complicated to me.

Fish-Git commented 4 years ago

STATUS UPDATE:  Still working on it! There's actually more to fixing this issue than initially meets the eye!

As far as the rc=35: deadlock TXF lock bug goes however (i.e. OPTION_TXF_SINGLE_THREAD), that was easily fixed, but the results were a bust (no go): it slows the system down a veritable crawl! Without it however, the system runs just fine. No slowdown and no problems or undesirable side effects from what I could see. In fact, without the unnecessary lock calls (that due to my stupid bug accomplished absolutely nothing anyway) the system no only runs faster, but much better as well! With OPTION_TXF_SINGLE_THREAD disabled (undefined), there were actually fewer constrained transaction failures:

14:19:05.697 +++ ctrans    =      707767
14:19:05.697 +++ 0 retries =      706770  (99.9%)
14:19:05.698 +++ 1 retries =         925  ( 0.1%)
14:19:05.698 +++ 2 retries =          56  ( 0.0%)
14:19:05.698 +++ 3 retries =          15  ( 0.0%)
14:19:05.698 +++ 4 retries =           1  ( 0.0%)
14:19:05.698 +++ 5 retries =           0  ( 0.0%)
14:19:05.698 +++ 6 retries =           0  ( 0.0%)
14:19:05.698 +++ 7 retries =           0  ( 0.0%)
14:19:05.698 +++ 8+retries =           0  ( 0.0%)

Keep an eye out for some new commits from me within the next few days.

Peter-J-Jansen commented 4 years ago

I think this problem is with a Constrained transaction after all.

After an initial PTT SIE TXF 1000 command, and HAO-triggering the PTT command when the problem occurs, this is what I found:

14:07:29 HHC90021I 14:07:29.621499 00007fe6fbfff700 Processor CP01  sie.c:870          SIE < sie_exit    0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621499 00007fe6fbfff700 Processor CP01  sie.c:877          SIE progjmp       0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621567 00007fe6fbfff700 Processor CP01  sie.c:305          SIE               0000000003518000 00000001ff0f3000 00000001ff0f3000
14:07:29 HHC90021I 14:07:29.621568 00007fe6fbfff700 Processor CP01  sie.c:345          SIE h,g,a         0000000000000001 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621569 00007fe6fbfff700 Processor CP01  sie.c:739          SIE a=1           0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621569 00007fe6fbfff700 Processor CP01  sie.c:861          SIE > run_sie     0000000000000002 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621569 00007fe6fbfff700 Processor CP01  sie.c:897          run_sie h,g,a     0000000000000001 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.621570 00007fe6fbfff700 Processor CP01  sie.c:919          run_sie setjmp    0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621570 00007fe6fbfff700 Processor CP01  sie.c:923          run_sie run...    0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622624 00007fe6fbfff700 Processor CP01  transact.c:650     TXF_TBEGINC       0000000000000002 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622624 00007fe6fbfff700 Processor CP01  transact.c:679     TXF TBEGINC       0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622633 00007fe6fbfff700 Processor CP01  cpu.c:462          TXF PIID          0000000000060204 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622634 00007fe6fbfff700 Processor CP01  cpu.c:470          TXF PROG?         0000000000000004 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622635 00007fe6fbfff700 Processor CP01  transact.c:1478    TXF filt?         0000000000000004 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622636 00007fe6fbfff700 Processor CP01  transact.c:1661    TXF unfilt!       0000000000000004 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622636 00007fe6fbfff700 Processor CP01  transact.c:1040    TXF abort         0000000000000000 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01  transact.c:1396    *TXF abrtret      0000000000000000 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01  cpu.c:477          *TXF UPROG!       0000000000000004 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01  sie.c:1180         run_sie !run      ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01  sie.c:1195         run_sie !run      ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622656 00007fe6fbfff700 Processor CP01  sie.c:919          run_sie setjmp    0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622656 00007fe6fbfff700 Processor CP01  sie.c:923          run_sie run...    0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622658 00007fe6fbfff700 Processor CP01  sie.c:1180         run_sie !run      ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622658 00007fe6fbfff700 Processor CP01  sie.c:1195         run_sie !run      ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622658 00007fe6fbfff700 Processor CP01  sie.c:919          run_sie setjmp    0000000000000000 0000000000000000 0000000000000000

So it appears it is a TBEGINC transaction experiencing a protection exception (because of TXF PIID 0000000000060204) which I have not yet tracked down. Also, I don't know yet why the Hercules logging is for an Unconstrained transaction.

By the way, I also discovered that apparently only very early on in the z/OS 2.4 IPL process there are four (4) TBEGINC instructions requesting SIE interception. From then onwards, no TBEGINC interceptions are seen anymore. Has anyone else experienced the same?

And the hunt continues; no target to fix has been spotted yet.

Cheers,

Peter

Fish-Git commented 4 years ago

Also, I don't know yet why the Hercules logging is for an Unconstrained transaction.

That is definitely weird and definitely cause for concern! It looks like our regs handling isn't right with respect to TXF under SIE. I'll try to look into it once I'm finished with what I'm working on.

I also discovered that apparently only very early on in the z/OS 2.4 IPL process there are four (4) TBEGINC instructions requesting SIE interception. From then onwards, no TBEGINC interceptions are seen anymore. Has anyone else experienced the same?

Probably. I don't know. I haven't looked into it closely enough.

Fish-Git commented 4 years ago

Enhanced (finer grained) TXF tracing committed.

I'll try to start looking into your constrained 204 bug now (as well as your 2.3 under vm 6.3 loop too if I get a chance). I'll let you know if I find anything.

How's TXF otherwise working for you? Okay?

Fish-Git commented 4 years ago

Where are we at with this bug?

I was never able to reproduce it myself *`()`**, and the OPTION_TXF_SINGLE_THREAD bug has already been fixed, so is this GitHub Issue still valid? Does this unconstrained transaction failure still occur? Is it still reproducible? Or can this issue be closed now?

Thanks!


*`()`**  I was never able to cause an type of unconstrained transaction failure. I was able to reproduce the lock bug though, but as I said, it was quickly identified and fixed, so it no longer applies.

Fish-Git commented 4 years ago

Peter, (@Peter-J-Jansen)

Are you 100% certain this is a TXF related issue? I suspect it may not be.

As you may recall from an offline email from Jürgen from about 1.4 weeks ago, he originally believed his Java JVM failure was being caused by a TXF failure too, but when examined more closely it was discovered that the problem was in fact not a TXF failure, but rather a bug in the KLMD instruction:

This "J9Generic_Signal_Number=" message is one of those many undocumented messages that the Java VM issues. It seems that it is used to communicate either an interrupt code (in z/Arch meaning) or a signal (in original Java meaning). Which of both cases hold true seems to depend from the module issuing the message and that can be many. So, while we saw with all “real” 218s that exactly same message, the KLMD caused case was recognized as a segment translation exception (z/Arch IC byte 2 = 10), translated to Java signal 24 (x’18’), which is defined as “Quit signal from a terminal”, a signal that exists in the Java VM z/OS implementation only. So, in both situations the message displays x’18’, although they have two completely different meanings.

Apparently, Java issues a "J9Generic_Signal_Number=" message for many different unrelated reasons, so one should not presume that a "J9Generic_Signal_Number: 00000018" is a TXF failure. The "00000018" does NOT always mean that a transaction has failed.

Additionally, in your original report, you showed an unconstrained transaction failure occurring at 16:16:45, but the Javacore dump reported its "J9Generic_Signal_Number: 00000018" failure as having occurred at "1TIDATETIME Date: 2020/07/25 at 14:16:47:398" -- a full 2 seconds later!

I suspect you, like Jürgen, may be erroneously presuming the TXF failure was somehow related to your Java JVM failure, when in reality the two are completely unrelated to one another. Your Java failure is very likely being caused by something else entirely. I suspect that its actual cause may in fact be the same cause as Jürgen's JVM failure: the bug he found in the KIMD instruction!

So my question to you is again, are you absolutely sure this problem you reported is indeed a TXF failure? Because I suspect it is not a TXF failure.

I would like to close this issue as a "false alarm", but do not want to do so if we indeed still have a TXF bug lurking somewhere in SIE code. But I honestly suspect this is in fact not a TXF bug, and thus we can safely close this issue as a false alarm.

Do you agree? Yes? No?

Peter-J-Jansen commented 4 years ago

Hi Fish,

I think we still have one or more TXF issues, and not only under SIE or JVM related. With all the fixes that have been committed since the original opening of this issue, their frequency has been greatly reduced. And the detailed symptoms like described in this issue are now no longer accurate. No, I can not say I'm 100% certain (*), my believe is based on my observations that (1) z/OS 2.3 without TXF never produces the problems. and (2) in a number of cases one component or another (e.g. JES2) was awaiting a lock to be freed which never happened. I am still testing and experimenting to find out more. I'm sorry it takes so long. And I will allow for the possibility that indeed the problem may be elsewhere but perhaps just TXF-triggered.

Cheers,

Peter

P.S. (*): "Nothing is certain, and not even that."

Fish-Git commented 4 years ago

(order of quotes rearranged for clarity)

With all the fixes that have been committed since the original opening of this issue, their frequency has been greatly reduced. And the detailed symptoms like described in this issue are now no longer accurate. No, I can not say I'm 100% certain...

I think we still have one or more TXF issues, and not only under SIE or JVM related.

And I will allow for the possibility that indeed the problem may be elsewhere but perhaps just TXF-triggered.

Then may I suggest doing the same thing as suggested in my Issue #309 comment? Closing this issue as no longer accurate and referring the user to a new replacement issue# that more accurately describes the current still-unresolved problem from a current-state-of-Hercules point of view?

Just a thought.   :)

Fish-Git commented 4 years ago

P.S. (*): "Nothing is certain, and not even that."

(LOL!)   A true truism if I ever heard one!   :))

Peter-J-Jansen commented 4 years ago

As stated on Issue #309, I agree that we close both Issues #309 and #315. I will do more testing, especially also since I've seen commit 89ca882. (This commit has also raised other questions in my mind which I need to be able to answer first, but that's another matter.)

Cheers,

Peter