Closed Peter-J-Jansen closed 4 years ago
Okay, I'm confused!
First of all, I can't find anywhere in Hercules where the following message is being issued:
HHC97703D TXF: CP03: Failed Outermost Uncons Transaction for TND 1 txfs=2 retry=0: TAC_UPGM (PGM Interruption (Unfiltered)), why = TXF_WHY_UNFILT_INT
The reported message number is HHC97703
but I can't find that message number anywhere in Hercules. I can certainly see message number HHC17703
being issued on line 1084 of transact.c of course, but I can't see anywhere where message number HHC97703
is being issued.
To further add to the confusion, where the frick is the text txfs=2 retry=0
coming from? I can't seem to find that text anywhere in all of Hercules either (neither txfs=
nor retry=
). :(
I'm hoping that maybe this was a new/temporary debugging message that maybe you added yourself, Peter? (Please say yes! Please tell me that's true! Because if it's not true I think I might honestly be going insane!)
But the following lock messages are of course my greatest concern:
HHC90013E 'obtain_lock(&sysblk.txf_lock2[ 3])' failed: rc=35: deadlock; tid=00007f90c1677700, loc=transact.c:673 HHC90014I lock &sysblk.txf_lock2[ 3] was obtained by thread 00007f90c1677700 at transact.c:673
Line 673 of transact.c is where the lock in being obtained by the TBEGINC
(Begin Constrained Transaction) instruction. Yet, the transaction that reportedly failed (according to the mysterious HHC97703D
message) was an UNconstrained transaction! (wtf?!)
Unless.................... (the penny finally drops!)
Ah-HA! Of course! It was a constrained transaction nested within an UNconstrained transaction! Doh! (me: smacks forehead) It seems so obvious now. I feel so stupid for not catching that right away. :(
Anyway, I think the fix for this is likely (hopefully!) going to be fairly easy and obvious. Let me jump right into it and I'll get back to you. Thanks for reporting it.
Hi Fish,
Thanks for your quick response!
HHC97703 is a modified version of HHC17703. The options good
and bad
on the txf
command are good but both produce many loggings so I though to hardcode a version only logging what I consider focussed loggings, hence my HHC97703 modification:
--- /home/hercules/Qsync/Source/Originals/SDL-hyperion/transact.c 2020-07-24 16:49:03.000000000 +0200
+++ /home/hercules/SDL-hyperion/transact.c 2020-07-25 16:06:44.000000000 +0200
@@ -1070,7 +1070,9 @@
/* Log the failure if debugging enabled */
/*---------------------------------------------*/
- if (TXF_TRACE( FAILURE, txf_contran ))
+ if (TXF_TRACE( FAILURE, txf_contran ) || regs->txf_caborts > 1
+ || ( txf_tac != TAC_EXT && txf_tac != TAC_IO && txf_tac != TAC_FETCH_CNF && txf_tac != TAC_STORE_CNF &&
+ ( regs->txf_why & ( ~( TXF_WHY_CSP_INSTR | TXF_WHY_DELAYED_ABORT | TXF_WHY_SIE_EXIT ) ) ) ) )
{
/* Report the reason WHY the transaction was aborted */
char why[ 256 ] = {0};
@@ -1081,9 +1083,10 @@
STRLCPY( why, " ?" );
// "TXF: %s%02X: %sFailed %s %s Transaction for TND %d: %s (%s), why =%s"
- WRMSG( HHC17703, "D", TXF_CPUAD( regs ), TXF_QSIE( regs ),
+#define HHC97703 "TXF: %s%02X: %sFailed %s %s Transaction for TND %d txfs=%d retry=%d: %s (%s), why =%s"
+ WRMSG( HHC97703, "D", TXF_CPUAD( regs ), TXF_QSIE( regs ),
txf_tnd > 1 ? "Nested" : "Outermost",
- txf_contran ? "Cons" : "Uncons", txf_tnd,
+ txf_contran ? "Cons" : "Uncons", txf_tnd, sysblk.txf_transcpus + 1, regs->txf_caborts,
tac2short( txf_tac ), tac2long( txf_tac ), why );
/* If this is a delayed abort, log who detected/requested it */
The txfs
and retry
just add some additional info. The focus is on transaction aborts that should ideally never occur, i.e. Constrained transactions with more than 1 retry (cabort), or transactions aborted due to events that should never happen, hence the txf_tac
and txf_why
filtering.
I have some questions regarding the OPTION_TXF_SINGLE_THREAD
option.
TBEGINC
i.e. Constrained transactions?I have even more questions, but that's for another topic.
Cheers,
Peter
HHC97703 is a modified version of HHC17703.
Thought so. :)
I have some questions regarding the
OPTION_TXF_SINGLE_THREAD
option.
- Am I correct to assume that it would only apply to
TBEGINC
i.e. Constrained transactions?
Yes.
2. Am I correct to assume that the option applies to each and every CPU?
Yes. That's what ......... er, ...... um, ...... (Oops!)
Boy that was stupid of me, wasn't it? (sheesh!)
The whole point of the option (and obviously the lock!) was to prevent all other CPUs from executing an unconstrained transaction while any other CPU was executing theirs. But I can see now that it's not doing that! I incorrectly coded it such that each CPU has their own lock! THAT'S WRONG! (DUH!)
There should be only ONE master TXF lock for ALL CPUs! That's the only way to prevent another CPU from trying to start their own unconstrained transaction after some other CPU has already started theirs! Duh!
(Dumb, Fish! Dumb, dumb, DUMB!! I can't believe I did that! That was really stupid!) :confounded:
Sorry about that. I'll get that fixed right away. Thank you for so delicately bringing it to my attention.
3. Would the current TXF implementation even ever work without that option if z/OS (or any other O.S.) would actually schedule a CPU to execute a second (or n-th) transaction on a CPU which is already executing one, but for example interrupted due to a page fault?
Um, I'm not sure I understand your question.
You can have as many concurrent transactions as you want (with the upper limit being the number of CPUs you have of course), but you obviously can't have more than transaction running at the same time on the same CPU. You can only have ONE transaction active at any one time on any given CPU. When a transaction is already running on a given CPU and it attempts to start a "new" transaction before the first one completes, it's not actually a new transaction. It's just another level of the same transaction. A nested transaction. But it's only the outermost first transaction that matters. If anything goes wrong before it reaches its end, then that entire single one transaction (of which all inner transactions are simply a part of) is aborted (which effectively aborts all inner/nested transaction as well but it of course is only the outermost original transaction whose abort PSW matters).
Does that answer your question?
Anyway, I'm currently working on your bug. I was able to reproduce it, thankfully. It didn't actually occur until IZUSVR1
actually finished executing several minutes later, but it did occur, and it looked exactly like yours did.
Right now I'm adding some more debug code for when a transaction aborts. Right now we have the option of displaying the associated TDB whenever a transaction aborts, but it's formatted as a simple storage dump. I'm adding some new debug code that will produce a formatted display of the individual TDB fields instead (including displaying the instruction that the transaction aborted on, the ATIA
).
I also want to create a new standalone runtest
test to test executing a constrained transaction within an unconstrained transaction too, just like IZUSVR1
is doing, but without having to wait several minutes for that condition to occur.
(We need to start thinking about how we're going to TEST this dang thing folks! Using z/OS and z/VM like we are, while necessary (to ensure current IBM operating systems behave properly under Hercules), is not the proper/correct way to prove that TXF is functioning correctly!)
That's it. I'll post a followup comment once I have my fix ready.
- Would the current TXF implementation even ever work without that option if z/OS (or any other O.S.) would actually schedule a CPU to execute a second (or n-th) transaction on a CPU which is already executing one, but for example interrupted due to a page fault?
Um, I'm not sure I understand your question.
Your answer to my 2nd question actually made this last 3rd question superfluous. Only now do I realize that any interruption during a transaction causes a transaction abort, so that, apart from a nested transaction, the same CPU can never be executing a different transaction in another thread simultaneously with a first one.
I agree fully with you that a standalone runtest
for TXF (or even different runtests for various TXF scenario's) would be ideal to have. I have been thinking about this, but so far must admit this looks awfully complicated to me.
STATUS UPDATE: Still working on it! There's actually more to fixing this issue than initially meets the eye!
As far as the rc=35: deadlock
TXF lock bug goes however (i.e. OPTION_TXF_SINGLE_THREAD
), that was easily fixed, but the results were a bust (no go): it slows the system down a veritable crawl! Without it however, the system runs just fine. No slowdown and no problems or undesirable side effects from what I could see. In fact, without the unnecessary lock calls (that due to my stupid bug accomplished absolutely nothing anyway) the system no only runs faster, but much better as well! With OPTION_TXF_SINGLE_THREAD
disabled (undefined), there were actually fewer constrained transaction failures:
14:19:05.697 +++ ctrans = 707767
14:19:05.697 +++ 0 retries = 706770 (99.9%)
14:19:05.698 +++ 1 retries = 925 ( 0.1%)
14:19:05.698 +++ 2 retries = 56 ( 0.0%)
14:19:05.698 +++ 3 retries = 15 ( 0.0%)
14:19:05.698 +++ 4 retries = 1 ( 0.0%)
14:19:05.698 +++ 5 retries = 0 ( 0.0%)
14:19:05.698 +++ 6 retries = 0 ( 0.0%)
14:19:05.698 +++ 7 retries = 0 ( 0.0%)
14:19:05.698 +++ 8+retries = 0 ( 0.0%)
Keep an eye out for some new commits from me within the next few days.
I think this problem is with a Constrained transaction after all.
After an initial PTT SIE TXF 1000 command, and HAO-triggering the PTT command when the problem occurs, this is what I found:
14:07:29 HHC90021I 14:07:29.621499 00007fe6fbfff700 Processor CP01 sie.c:870 SIE < sie_exit 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621499 00007fe6fbfff700 Processor CP01 sie.c:877 SIE progjmp 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621567 00007fe6fbfff700 Processor CP01 sie.c:305 SIE 0000000003518000 00000001ff0f3000 00000001ff0f3000
14:07:29 HHC90021I 14:07:29.621568 00007fe6fbfff700 Processor CP01 sie.c:345 SIE h,g,a 0000000000000001 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621569 00007fe6fbfff700 Processor CP01 sie.c:739 SIE a=1 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621569 00007fe6fbfff700 Processor CP01 sie.c:861 SIE > run_sie 0000000000000002 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621569 00007fe6fbfff700 Processor CP01 sie.c:897 run_sie h,g,a 0000000000000001 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.621570 00007fe6fbfff700 Processor CP01 sie.c:919 run_sie setjmp 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.621570 00007fe6fbfff700 Processor CP01 sie.c:923 run_sie run... 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622624 00007fe6fbfff700 Processor CP01 transact.c:650 TXF_TBEGINC 0000000000000002 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622624 00007fe6fbfff700 Processor CP01 transact.c:679 TXF TBEGINC 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622633 00007fe6fbfff700 Processor CP01 cpu.c:462 TXF PIID 0000000000060204 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622634 00007fe6fbfff700 Processor CP01 cpu.c:470 TXF PROG? 0000000000000004 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622635 00007fe6fbfff700 Processor CP01 transact.c:1478 TXF filt? 0000000000000004 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622636 00007fe6fbfff700 Processor CP01 transact.c:1661 TXF unfilt! 0000000000000004 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622636 00007fe6fbfff700 Processor CP01 transact.c:1040 TXF abort 0000000000000000 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01 transact.c:1396 *TXF abrtret 0000000000000000 0000000000000000 0000000000000001
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01 cpu.c:477 *TXF UPROG! 0000000000000004 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01 sie.c:1180 run_sie !run ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622655 00007fe6fbfff700 Processor CP01 sie.c:1195 run_sie !run ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622656 00007fe6fbfff700 Processor CP01 sie.c:919 run_sie setjmp 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622656 00007fe6fbfff700 Processor CP01 sie.c:923 run_sie run... 0000000000000000 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622658 00007fe6fbfff700 Processor CP01 sie.c:1180 run_sie !run ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622658 00007fe6fbfff700 Processor CP01 sie.c:1195 run_sie !run ffffffffffffffff 0000000000000000 0000000000000000
14:07:29 HHC90021I 14:07:29.622658 00007fe6fbfff700 Processor CP01 sie.c:919 run_sie setjmp 0000000000000000 0000000000000000 0000000000000000
So it appears it is a TBEGINC transaction experiencing a protection exception (because of TXF PIID 0000000000060204
) which I have not yet tracked down. Also, I don't know yet why the Hercules logging is for an Unconstrained transaction.
By the way, I also discovered that apparently only very early on in the z/OS 2.4 IPL process there are four (4) TBEGINC instructions requesting SIE interception. From then onwards, no TBEGINC interceptions are seen anymore. Has anyone else experienced the same?
And the hunt continues; no target to fix has been spotted yet.
Cheers,
Peter
Also, I don't know yet why the Hercules logging is for an Unconstrained transaction.
That is definitely weird and definitely cause for concern! It looks like our regs handling isn't right with respect to TXF under SIE. I'll try to look into it once I'm finished with what I'm working on.
I also discovered that apparently only very early on in the z/OS 2.4 IPL process there are four (4) TBEGINC instructions requesting SIE interception. From then onwards, no TBEGINC interceptions are seen anymore. Has anyone else experienced the same?
Probably. I don't know. I haven't looked into it closely enough.
Enhanced (finer grained) TXF tracing committed.
I'll try to start looking into your constrained 204 bug now (as well as your 2.3 under vm 6.3 loop too if I get a chance). I'll let you know if I find anything.
How's TXF otherwise working for you? Okay?
Where are we at with this bug?
I was never able to reproduce it myself *`()`**, and the OPTION_TXF_SINGLE_THREAD bug has already been fixed, so is this GitHub Issue still valid? Does this unconstrained transaction failure still occur? Is it still reproducible? Or can this issue be closed now?
Thanks!
*`()`** I was never able to cause an type of unconstrained transaction failure. I was able to reproduce the lock bug though, but as I said, it was quickly identified and fixed, so it no longer applies.
Peter, (@Peter-J-Jansen)
Are you 100% certain this is a TXF related issue? I suspect it may not be.
As you may recall from an offline email from Jürgen from about 1.4 weeks ago, he originally believed his Java JVM failure was being caused by a TXF failure too, but when examined more closely it was discovered that the problem was in fact not a TXF failure, but rather a bug in the KLMD instruction:
This "J9Generic_Signal_Number=" message is one of those many undocumented messages that the Java VM issues. It seems that it is used to communicate either an interrupt code (in z/Arch meaning) or a signal (in original Java meaning). Which of both cases hold true seems to depend from the module issuing the message and that can be many. So, while we saw with all “real” 218s that exactly same message, the KLMD caused case was recognized as a segment translation exception (z/Arch IC byte 2 = 10), translated to Java signal 24 (x’18’), which is defined as “Quit signal from a terminal”, a signal that exists in the Java VM z/OS implementation only. So, in both situations the message displays x’18’, although they have two completely different meanings.
Apparently, Java issues a "J9Generic_Signal_Number=" message for many different unrelated reasons, so one should not presume that a "J9Generic_Signal_Number: 00000018" is a TXF failure. The "00000018" does NOT always mean that a transaction has failed.
Additionally, in your original report, you showed an unconstrained transaction failure occurring at 16:16:45, but the Javacore dump reported its "J9Generic_Signal_Number: 00000018" failure as having occurred at "1TIDATETIME Date: 2020/07/25 at 14:16:47:398" -- a full 2 seconds later!
I suspect you, like Jürgen, may be erroneously presuming the TXF failure was somehow related to your Java JVM failure, when in reality the two are completely unrelated to one another. Your Java failure is very likely being caused by something else entirely. I suspect that its actual cause may in fact be the same cause as Jürgen's JVM failure: the bug he found in the KIMD instruction!
So my question to you is again, are you absolutely sure this problem you reported is indeed a TXF failure? Because I suspect it is not a TXF failure.
I would like to close this issue as a "false alarm", but do not want to do so if we indeed still have a TXF bug lurking somewhere in SIE code. But I honestly suspect this is in fact not a TXF bug, and thus we can safely close this issue as a false alarm.
Do you agree? Yes? No?
Hi Fish,
I think we still have one or more TXF issues, and not only under SIE or JVM related. With all the fixes that have been committed since the original opening of this issue, their frequency has been greatly reduced. And the detailed symptoms like described in this issue are now no longer accurate. No, I can not say I'm 100% certain (*), my believe is based on my observations that (1) z/OS 2.3 without TXF never produces the problems. and (2) in a number of cases one component or another (e.g. JES2) was awaiting a lock to be freed which never happened. I am still testing and experimenting to find out more. I'm sorry it takes so long. And I will allow for the possibility that indeed the problem may be elsewhere but perhaps just TXF-triggered.
Cheers,
Peter
P.S. (*): "Nothing is certain, and not even that."
(order of quotes rearranged for clarity)
With all the fixes that have been committed since the original opening of this issue, their frequency has been greatly reduced. And the detailed symptoms like described in this issue are now no longer accurate. No, I can not say I'm 100% certain...
I think we still have one or more TXF issues, and not only under SIE or JVM related.
And I will allow for the possibility that indeed the problem may be elsewhere but perhaps just TXF-triggered.
Then may I suggest doing the same thing as suggested in my Issue #309 comment? Closing this issue as no longer accurate and referring the user to a new replacement issue# that more accurately describes the current still-unresolved problem from a current-state-of-Hercules point of view?
Just a thought. :)
P.S. (*): "Nothing is certain, and not even that."
(LOL!) A true truism if I ever heard one! :))
As stated on Issue #309, I agree that we close both Issues #309 and #315. I will do more testing, especially also since I've seen commit 89ca882. (This commit has also raised other questions in my mind which I need to be able to answer first, but that's another matter.)
Cheers,
Peter
z/OSMF in z/OS 2.4 under the latest SDL-Hyperion (version 4.3.9999.0-SDL-gbca5d05c (4.3.9999.0)) reproducibly fails in an Unconstrained TBEGIN transaction. The Hercules log shows:
At the same time, the z/OS log shows:
The Javacore dump starts with:
From looking at the code involving OPTION_TXF_SINGLE_THREAD the impression I get is that it does not (yet) really cater for Unconstrained transactions, and a quick test with a build without this option producing identically failing results seems to confirm it.
Thanks,
Peter