SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
233 stars 87 forks source link

MAINLOCK insufficient for protecting non-assisted cmpxchg1/4/8/16 functions #247

Closed Peter-J-Jansen closed 4 years ago

Peter-J-Jansen commented 4 years ago

OBTAIN_MAINLOCK / RELEASE_MAINLOCK does not protect against main memory changes by regular instructions such as MVI or ST and all others.

In the case of non-assisted cmpxchg1/4/8/16 functions (all of which are surrounded by such MAINLOCK protection pairs), such as when building for Raspberry Pi (RPI) ARM processors, the last time this caused a problem was prior to my fix in May 2015 for the TS instruction which uses cmpxchg1. That fixed solved the problem when IPL-ing z/VM on an RPI 2.

With the RPI 4 with 4GB now being available, and a 64 bit (unofficial) Ubuntu 18.04.3, z/OS can now also be successfully IPL'd on it. However, after some time, z/OS freezes. (It can take up to a few hours for it to occur, but is (was) quite reproducible). I suspect the problem to be with MAINLOCK not sufficiently protecting non-assisted cmpxchg# instructions.

As C11 atomics are now readily available, I coded assisted cmpxchg1/4/8 functions, and the problem of z/OS freezing on the RPI 4 appears to have gone away. (The test has been running OK for about 24 hours right now.)

  QUESTION: Is my assumption, that MAINLOCK can only protect against other MAINLOCK protected main memory accesses, correct or not?  

If the answer to this question is positive, then I question the usefulness of MAINLOCK completely!

Currently, MAINLOCK protects about 9 "simple" compare-and-swap style instructions; TS is just one of them. With "simple" I mean that these all end up using a single cmpxchg# function, which in the case of hardware- (C11-) assist protects itself, also against all regular main memory access.

Additionally, 9 other instructions are currently MAINLOCK protected : 4 MVS assist and 1 LKPG instructions (these 5 are perhaps not used anymore), 2 "complex" compare-and-swap style instructions, CSST and PLO, and 2 instructions needing 16 byte atomic store and load (STPQ and LPQ).

I think we'd need to find a better solution so that we don't need MAINLOCK anymore. Provided hardware- (C11-) assists are available, I believe this can be safely done.

Feedback/Discussion is strongly encouraged!

Cheers,

Peter

Fish-Git commented 4 years ago

QUESTION: Is my assumption, that MAINLOCK can only protect against other MAINLOCK protected main memory accesses, correct or not?

Duh! :)

A lock, any lock, can only protect the specific resource it was meant (designed) to protect, and only when the lock is used before accessing that resource. This is standard lock usage. As asked, your question is quite rhetorical: the answer is of course YES (which I'm sure you already knew).

I think we'd need to find a better solution so that we don't need MAINLOCK anymore. Provided hardware- (C11-) assists are available, I believe this can be safely done.

THAT is the key: Provided hardware- (C11-) assists are available!

But what if hardware/C11 assists are not available? What then? If that is the case then it sounds like we would still need MAINLOCK. Yes?

Based on what you've written (and a brief (cursory) examination of the mentioned code), it sounds to me like we cannot safely remove MAINLOCK because hardware/C11 assists might not be available on all supported systems. Yes?

Now what we might be able to perhaps do is maybe devise a new COND_OBTAIN/RELEASE_MAINLOCK macro that does nothing whenever hardware/C11 assists are available, and use that instead in those few specific places where OBTAIN/RELEASE_MAINLOCK is currently used to wrap a cmpxchg call, such as the compare_and_swap instruction for example:

--- hyperion-1/general1.c   2019-08-07 03:23:15.530798900 -0700
+++ hyperion-0/general1.c   2019-09-16 03:48:49.722035400 -0700
@@ -1898,13 +1898,13 @@
     old = CSWAP32(regs->GR_L(r1));

     /* Obtain main-storage access lock */
-    OBTAIN_MAINLOCK(regs);
+    COND_OBTAIN_MAINLOCK(regs);

     /* Attempt to exchange the values */
     regs->psw.cc = cmpxchg4 (&old, CSWAP32(regs->GR_L(r3)), main2);

     /* Release main-storage access lock */
-    RELEASE_MAINLOCK(regs);
+    COND_RELEASE_MAINLOCK(regs);

     /* Perform serialization after completing operation */
     PERFORM_SERIALIZATION (regs);

This would eliminate the overhead of obtaining/releasing a lock (MAINLOCK) for those specific cases where doing so wasn't actually necessary (because the cmpxchg4 function itself, by virtue of it using hardware/C11 assists, would therefore eliminate the need for using such a lock), but only in such cases would it be safe. In other situations it might not be. In other cases we might still need to use MAINLOCK. I don't know. I haven't studied the code or thought about things enough yet.

But that's my initial feedback off the top of my head.

Anyone else want to jump in with their thoughts?

Fish-Git commented 4 years ago

I'm also curious what your proposal is for resolving the actual problem as described by the title of this GitHub Issue: "MAINLOCK insufficient for protecting non-assisted cmpxchg1/4/8/16 functions".

If MAINLOCK isn't good enough for protecting non-assisted cmpxchg1/4/8/16 functions, then what do you propose we do to fix that? How do you propose we fix things so that non-assisted cmpxchg1/4/8/16 functions are protected? That is to say, how do we reliably protect cmpxchg1/4/8/16 function calls when hardware/C11 assists are not available?

Right now we're using MAINLOCK which you're saying isn't good enough. But if using MAINLOCK in such situations isn't good enough, then what is? What's the fix? What do we do? How do we (how can we) reliably protect non-assisted cmpxchg1/4/8/16 calls?

Peter-J-Jansen commented 4 years ago

Dear Fish,

Thanks for your feedback.

In the meantime, I've been quite busy researching this some more, and the first thing I now have to do is to correct myself somewhat to be more precise. What I referred to as "compare-and-swap" style instructions should probably be more correctly referred to by using the same term the Principles of Operations (PoO) uses to refer to them with, which is "Interlocked-Update References" (IUR) type instructions.

The SA22-7832-11 Principles of Operation manual lists them on pg. 5-124, and it shows that they only protect against themselves. (I.e. they do not need to protect against all other instructions such as the MVI and the ST instructions that I initially mentioned.)

But (some of?) the Interlocked Access Facility (IAF) 1 and 2 instructions are mentioned, and none of these are currently MAINLOCK protected in Hercules! (The IAF instructions in Hercules currently rely solely on C11 atomics, i.e. macro H_ATOMIC_OP.)

The reason we've been getting away with this is that on our main processor architecture, amd64 (i.e. the current Intel's), all of these IUR instructions are assisted for both Windows (thanks to you) as well as for Linux, except for ... 2 instructions: CDSG (in esame.c) and CSST (in general1.c), both of which rely on cmpxchg16 which is currently unassisted.)

In order to add support for my newest platform, a 4GB RPI 4, I coded cmpxchg1/4/8 assists using GCC's C11 atomics which appear to work fine for both Linux and MacOS. As a result, z/OS 2.3 (as well as z/VM 6.3) run quite stable on Linux, MacOS and the RPI 4 as well.

Unfortunately however, cmpxchg16 poses problems for the 4GB RPI 4.

The GCC C11 atomics generate some library calls for cmpxchg16, for which I have to add the GCC option -latomic, and the resulting build does not work yet. For the amd64 platform however, I was able to code a cmpxchg16b based assist (thanks to your FIXME hint in machdep.h). For the RPI 4 platform I discovered that it's CPU is ARMv8, which does not support the casp (compare and swap) instruction. (Only the ARMv8.1 does.) So this cmpxchg16 will need a different approach and I haven't figured that out yet.

One instruction which is remarkably missing from the PoO list on pg. 5.124 is the Perform Locked Operation (PLO) instruction. The PLO instruction (in general2.c) is MAINLOCK protected, and this is just fine, as it only has to protect against other PLO's on other CPU's. This MAINLOCK should definitely stay! (I would however recommend renaming it to maybe MAINLOCK_PLO or MAINLOCK_UNCONDITIONAL instead.)

The STPQ and LPQ instructions are not listed on the PoO list on pg. 5.124, but these too should nevertheless (in my opinion) be implemented using newly-to-be-created atomic store16 and load16 functions since using MAINLOCK for these would not help at all.

If all of the above mentioned assists (recommended changes) were available, then the current MAINLOCK macro could be modified so that if assists were available for all of the cmpxchg1/4/8 (and cmpxchg16 too of course!) ,... and the MVS Assists are not in the build, ... and the LKPG instruction is not in the build, THEN ... under these conditions, the MAINLOCK macro could be coded to do effectively nothing, thus effectively becoming a conditional one. All existing use of MAINLOCK could remain as is. Only it's definitions (OBTAIN... and RELEASE... in hmacros.h) would need adjustment. The end result would avoid all but the PLO MAINLOCK, so we'd gain a wee bit of performance.

Theoretically however, the IAF instructions in the PoO list on pg. 5.124 would also need to be MAINLOCK protected as well, and I suspect it was probably one or more of these that likely caused my original RPI platform problems with the TS and some other compare-and-swap instructions prior to my assists for the ARM processor for the RPI.

What do you think Fish?

(And yes, I'm of course also very much interested what our other SDL Hercules team players think about all of this!)

Cheers,

Peter

Peter-J-Jansen commented 4 years ago

Fish, as a side note: how do you get these nice colors in your github replies?

Fish-Git commented 4 years ago

Fish, as a side note: how do you get these nice colors in your github replies?

You mean the diff? (patch?) Simple! Just specify "diff" (or "patch"?) as the language code in your fenced code block:

```diff (patch statements would go here) (patch statements would go here) (patch statements would go here) ```

```C (source code snippet would go here...) (source code snippet would go here...) (source code snippet would go here...) ```

etc...

GitHub supports many different languages. I'm not sure if there's any web page anywhere that lists all of the languages that it supports, but I only ever use 'C' and patch/diff.

NOTE: the above will be properly colored when you actually do it for real, but the above syntax examples aren't colored because I escaped the triple back ticks so you could see the actual syntax.

Ref:

  p.s. You are aware that any of us (SDL Team) can "edit" anyone else's posts, right? I edit people's posts all the time so that they're better formatted and thus easier to read.

Peter-J-Jansen commented 4 years ago

Thanks for the diff and C tips Fish!

I have a first version ready of an ARMv8 intrinsic for cmpchgx16, so I can start testing on my RPI 4 platform for an SDL-Hercules with effectively only the PLO instruction remaining MAINLOCK-ed. I'll keep you informed.

I'm also thinking about rigging up Hercules tests for these "compare-and-swap" instructions, but ... that is not going to be easy, as the test program would have to be multitasking between 2 CPU's. That may prove to above my bare metal HLASM abilities ...

Cheers,

Peter

Fish-Git commented 4 years ago

What I referred to as "compare-and-swap" style instructions should probably be more correctly referred to [as] "Interlocked-Update References" (IUR) type instructions.

Okay.

they only protect against themselves. (I.e. they do not need to protect against all other instructions such as the MVI and the ST instructions that I initially mentioned.

Okay.

But [...] the Interlocked Access Facility (IAF) 1 and 2 instructions are [also] mentioned, and none of these are currently MAINLOCK protected in Hercules! (The IAF instructions in Hercules currently rely solely on C11 atomics, i.e. macro H_ATOMIC_OP.)

Well that's a problem that we definitely need to fix!

The reason we've been getting away with this is [...] all of these IUR instructions are assisted for both Windows [...] as well as for Linux [...]

Okay.

... except for ... 2 instructions: CDSG (in esame.c) and CSST (in general1.c), both of which rely on cmpxchg16 which is currently unassisted.)

So we need to definitely code an assisted cmpxchg16 macro in machdep.h that works across the board (i.e. for both Linux and Windows and other platforms too). To do that we're probably going to need to add some code that checks for the availability of the x86_64 cmpxchg16b host instruction so we can use that in our assist. *`()`**

The GCC C11 atomics generate some library calls for cmpxchg16 ...

It's probably calling some function or performing some test to check to see if the cmpxchg16b instruction is available or not first before attempting to use it (for the likely reason as mentioned in my footnote further below).

(I would however recommend renaming it to maybe MAINLOCK_PLO or MAINLOCK_UNCONDITIONAL instead.)

Agreed.

Theoretically however, the IAF instructions in the PoO list on pg. 5.124 would also need to be MAINLOCK protected as well,

Absolutely! (based on what you've said thus far) As I said, that definitely needs to be fixed!

What do you think Fish?

I think you've got a good handle on things, Peter, and have complete trust in your ability to resolve this issue. If you need any help with anything please don't hesitate to ask.

(And yes, I'm of course also very much interested what our other SDL Hercules team players think about all of this!)

ME TOO!!

Guys? :)


*`()** I'm not sure how such works on other platforms, but on Windows, the documentation clearly states: _"If you run code that uses this intrinsic on hardware that does not support the cmpxchg16b instruction, the results are unpredictable."_, which I interpret as meaning your program will crash. That is to say, using the_InterlockedCompareExchange128compiler intrinsic simply reduces to acmpxchg16bx86_64 host instruction, so if the processor that Hercules is running on is one that doesn't support thecmpxchg16binstruction, then it will crash. Thus we're going to need to develop some type of "configure.ac" test that checks to see if the instruction is available and set a #define constant thatmachdep.hcan then check. (Not sure how we're going to do that on Windows since Windows doesn't do any type of pre-build "configure.ac" processing, but I'm sure we can figure something out.) And we're of course going to need to code a reliable method to accomplish the same thing for the case where it's not available. (That is to say, if thecmpxchg16binstruction is _not_ available, our newcmpxchg16macro inmachdep.h` is going to need to somehow reliably accomplish the same thing using other instructions.)

Peter-J-Jansen commented 4 years ago

I was able to code assists forcmpxchg1 / 4 / 8 for the case where C11 atomics are available, which applies to RPI 4's ARMv8-A and to Intel 86 processors. I was also able to code a cmpxchg16 assist too, for the Linux and MacOS platforms.

However, I did not succeed doing this for Windows MSVC. No matter what I tried I simply could not manage to get the MSVC _InterlockedCompareExchange128 intrinsic to work.

I was also able to produce a working cmpxchg16 Hercules runtest test, which I called CDSG. I verified on all of my non-Windows platforms (Ubuntu 18.04 on Intel and RPI 4, and MacOS on Intel), that cmpxchg16 works fine. I verified all cmpxchg16 implementations, including the non-assisted ones. This CDSG test is based on earlier excellent work for the CBUC test by other Herculeans (originally written by Ivan and made into a runtest test by Fish). Without this CBUC example I would not have been able to do this.

Of course I used my CDSG test to try debugging the MSVC _InterlockedCompareExchange128 intrinsic based cmpxchg16, but was ultimately unsuccessful. I am prepared to think that perhaps there is an MSVC bug involved here, but I dare not say that as I am too unfamiliar with MSVC intrinsics. I have left my attempt in machdep.h, but have disabled it between a #if 0 #endif pair. Perhaps Fish can rescue me here?

(What I did not try was to use MSVC inline assembly to code my own cmpxchg16b; I presume MSVC nowadays supports inline assembly, but I've never tried that.)

I also did not attempt rigging up a preprocessor variable like _HAVE_CMPXCHG16B, so trying this on some quite old processors which do not implement that instruction will fail.

I also modified the STPQ and LPQ instructions to use cmpxchg16. And I changed the OBTAIN_ / RELEASE_MAINLOCK around the PLO instruction to OBTAIN_ / RELEASE_MAINLOCK_UNCONDITIONAL, so that all other MAINLOCK usage can become conditional when all cmpxchg1 / 4 / 8 / 16 are assisted. This applies to all platforms except the MVSC platform.

Obviously I am worried about committing all of this, which is why I've not done that yet, and why I include the diff further below. Please let me know how you feel about this. Thanks in advance for your efforts!

Cheers,

Peter


The diff I propose to commit is:

diff -uZ --color /home/hercules/orig/esame.c /home/hercules/chng/esame.c
--- /home/hercules/orig/esame.c 2019-09-07 13:54:30.886180000 +0200
+++ /home/hercules/chng/esame.c 2019-09-24 15:12:55.848323678 +0200
@@ -1535,7 +1535,8 @@
 int     r1;                             /* Value of R field          */
 int     b2;                             /* Base of effective addr    */
 VADR    effective_addr2;                /* Effective address         */
-QWORD   qwork;                          /* Quadword work area        */
+QWORD   qwork = { 0 };                  /* Quadword work area        */
+BYTE   *main2;                          /* mainstor address          */

     RXY(inst, regs, r1, b2, effective_addr2);

@@ -1543,15 +1544,22 @@

     QW_CHECK(effective_addr2, regs);

-    /* Store regs in workarea */
-    STORE_DW(qwork, regs->GR_G(r1));
-    STORE_DW(qwork+8, regs->GR_G(r1+1));
+    /* Get operand mainstor address */
+    main2 = MADDR (effective_addr2, b2, regs, ACCTYPE_WRITE, regs->psw.pkey);

     /* Store R1 and R1+1 registers to second operand
        Provide storage consistancy by means of obtaining
        the main storage access lock */
     OBTAIN_MAINLOCK(regs);
-    ARCH_DEP(vstorec) ( qwork, 16-1, effective_addr2, b2, regs );
+
+    // The cmpxchg16 trick, by comparing against what is already there in main2,
+    // so that the writing of the main2 memory will alwys take place, seems obvious
+    // but violates the rule that these addresses MUST be different.  This does NOT
+    // work, we tried it.  The solution using cmpxchg16 is as follows (please note
+    // that the initial cmpxchg16 is also used for the LPQ instruction)
+    (void) cmpxchg16 ((U64 *)(qwork), (U64 *)(qwork+8), 0, 0, main2 );
+    while( cmpxchg16 ((U64 *)(qwork), (U64 *)(qwork+8), CSWAP64(regs->GR_G(r1)), CSWAP64(regs->GR_G(r1+1)), (U64 *)(main2)) )
+    ;
     RELEASE_MAINLOCK(regs);

 } /* end DEF_INST(store_pair_to_quadword) */
@@ -1567,7 +1575,8 @@
 int     r1;                             /* Value of R field          */
 int     b2;                             /* Base of effective addr    */
 VADR    effective_addr2;                /* Effective address         */
-QWORD   qwork;                          /* Quadword work area        */
+QWORD   qwork = { 0 };                  /* Quadword work area        */
+BYTE   *main2;                          /* mainstor address          */

     RXY(inst, regs, r1, b2, effective_addr2);

@@ -1575,11 +1584,18 @@

     QW_CHECK(effective_addr2, regs);

+    /* Get operand mainstor address */
+    main2 = MADDR (effective_addr2, b2, regs, ACCTYPE_READ, regs->psw.pkey);
+
     /* Load R1 and R1+1 registers contents from second operand
        Provide storage consistancy by means of obtaining
        the main storage access lock */
     OBTAIN_MAINLOCK(regs);
-    ARCH_DEP(vfetchc) ( qwork, 16-1, effective_addr2, b2, regs );
+
+    // We use the 2nd cmpxchg16 trick, which will write a zero qwork only
+    // if the main2 quadword is already zero, effectively a NO-OP.  As we
+    // have initialised qwork with zero, we always achieve what we need.
+    (void) cmpxchg16 ((U64 *)(qwork), (U64 *)(qwork+8), 0, 0, main2) ;
     RELEASE_MAINLOCK(regs);

     /* Load regs from workarea */
@@ -6680,9 +6696,14 @@
     /* Get byte mainstor address */
     dest = MADDR (effective_addr1, b1, regs, ACCTYPE_WRITE, regs->psw.pkey );

+    // MAINLOCK may be required depending on lacking cmpxchg assists.
+    OBTAIN_MAINLOCK(regs);
+
     /* AND byte with immediate operand, setting condition code */
     regs->psw.cc = (H_ATOMIC_OP(dest, i2, and, And, &) != 0);

+    RELEASE_MAINLOCK(regs);
+
     ITIMER_UPDATE(effective_addr1,0,regs);

 } /* end DEF_INST(and_immediate_y) */
@@ -7060,9 +7081,14 @@
     /* Get byte mainstor address */
     dest = MADDR (effective_addr1, b1, regs, ACCTYPE_WRITE, regs->psw.pkey );

+    // MAINLOCK may be required depending on lacking cmpxchg assists.
+    OBTAIN_MAINLOCK(regs);
+
     /* XOR byte with immediate operand, setting condition code */
     regs->psw.cc = (H_ATOMIC_OP(dest, i2, xor, Xor, ^) != 0);

+    RELEASE_MAINLOCK(regs);
+
     ITIMER_UPDATE(effective_addr1,0,regs);

 } /* end DEF_INST(exclusive_or_immediate_y) */
@@ -7435,9 +7461,14 @@
     /* Get byte mainstor address */
     dest = MADDR (effective_addr1, b1, regs, ACCTYPE_WRITE, regs->psw.pkey );

+    // MAINLOCK may be required depending on lacking cmpxchg assists.
+    OBTAIN_MAINLOCK(regs);
+
     /* OR byte with immediate operand, setting condition code */
     regs->psw.cc = (H_ATOMIC_OP(dest, i2, or, Or, |) != 0);

+    RELEASE_MAINLOCK(regs);
+
     ITIMER_UPDATE(effective_addr1,0,regs);

diff -uZ --color /home/hercules/orig/general1.c /home/hercules/chng/general1.c
--- /home/hercules/orig/general1.c  2019-09-07 13:54:30.915178700 +0200
+++ /home/hercules/chng/general1.c  2019-09-24 15:12:49.580429201 +0200
@@ -240,9 +240,14 @@
     /* Get byte mainstor address */
     dest = MADDR (effective_addr1, b1, regs, ACCTYPE_WRITE, regs->psw.pkey );

+    // MAINLOCK may be required depending on lacking cmpxchg assists.
+    OBTAIN_MAINLOCK(regs);
+
     /* AND byte with immediate operand, setting condition code */
     regs->psw.cc = (H_ATOMIC_OP(dest, i2, and, And, &) != 0);

+    RELEASE_MAINLOCK(regs);
+
     /* Update interval timer if necessary */
     ITIMER_UPDATE(effective_addr1, 0, regs);
 }
@@ -4163,9 +4168,14 @@
     /* Get byte mainstor address */
     dest = MADDR (effective_addr1, b1, regs, ACCTYPE_WRITE, regs->psw.pkey );

+    // MAINLOCK may be required depending on lacking cmpxchg assists.
+    OBTAIN_MAINLOCK(regs);
+
     /* XOR byte with immediate operand, setting condition code */
     regs->psw.cc = (H_ATOMIC_OP(dest, i2, xor, Xor, ^) != 0);

+    RELEASE_MAINLOCK(regs);
+
     ITIMER_UPDATE(effective_addr1, 0, regs);
 }

diff -uZ --color /home/hercules/orig/general2.c /home/hercules/chng/general2.c
--- /home/hercules/orig/general2.c  2019-09-09 13:48:36.206567934 +0200
+++ /home/hercules/chng/general2.c  2019-09-24 15:12:49.584429134 +0200
@@ -97,8 +97,14 @@
     /* Get byte mainstor address */
     dest = MADDR (effective_addr1, b1, regs, ACCTYPE_WRITE, regs->psw.pkey );

+    // MAINLOCK may be required depending on lacking cmpxchg assists.
+    OBTAIN_MAINLOCK(regs);
+
     /* OR byte with immediate operand, setting condition code */
     regs->psw.cc = (H_ATOMIC_OP(dest, i2, or, Or, |) != 0);
+
+    RELEASE_MAINLOCK(regs);
+
     ITIMER_UPDATE(effective_addr1, 0, regs);
 }

@@ -395,7 +401,7 @@
            in the configuration.  We simply use 1 lock which is the
            main storage access lock which is also used by CS, CDS
            and TS.                                               *JJ */
-        OBTAIN_MAINLOCK(regs);
+        OBTAIN_MAINLOCK_UNCONDITIONAL(regs);

         switch(regs->GR_L(0) & PLO_GPR0_FC)
         {
@@ -505,7 +511,7 @@
         }

         /* Release main-storage access lock */
-        RELEASE_MAINLOCK(regs);
+        RELEASE_MAINLOCK_UNCONDITIONAL(regs);

         if(regs->psw.cc && sysblk.cpus > 1)
         {
diff -uZ --color /home/hercules/orig/general3.c /home/hercules/chng/general3.c
--- /home/hercules/orig/general3.c  2019-09-07 13:54:30.918179700 +0200
+++ /home/hercules/chng/general3.c  2019-09-24 15:12:49.592428999 +0200
@@ -82,8 +82,14 @@
         /* Interlocked exchange if operand is on a fullword boundary */
         old = CSWAP32(n);
         new = CSWAP32(result);
+
+        // MAINLOCK may be required depending on lacking cmpxchg assists.
+        OBTAIN_MAINLOCK(regs);
+
         rc = cmpxchg4(&old, new, m1);

+        RELEASE_MAINLOCK(regs);
+
     } while (rc != 0);

     /* Set condition code in PSW */
@@ -145,8 +151,14 @@
         /* Interlocked exchange if operand is on doubleword boundary */
         old = CSWAP64(n);
         new = CSWAP64(result);
+
+        // MAINLOCK may be required depending on lacking cmpxchg assists.
+        OBTAIN_MAINLOCK(regs);
+
         rc = cmpxchg8(&old, new, m1);

+        RELEASE_MAINLOCK(regs);
+
     } while (rc != 0);

     /* Set condition code in PSW */
@@ -2686,8 +2698,14 @@
         /* Interlocked exchange to storage location */
         old = CSWAP32(v2);
         new = CSWAP32(result);
+
+        // MAINLOCK may be required depending on lacking cmpxchg assists.
+        OBTAIN_MAINLOCK(regs);
+
         rc = cmpxchg4 (&old, new, m2);

+        RELEASE_MAINLOCK(regs);
+
     } while (rc != 0);

     /* Load original storage operand value into R1 register */
@@ -2762,8 +2780,14 @@
         /* Interlocked exchange to storage location */
         old = CSWAP64(v2);
         new = CSWAP64(result);
+
+        // MAINLOCK may be required depending on lacking cmpxchg assists.
+        OBTAIN_MAINLOCK(regs);
+
         rc = cmpxchg8 (&old, new, m2);

+        RELEASE_MAINLOCK(regs);
+
     } while (rc != 0);

     /* Load original storage operand value into R1 register */
diff -uZ --color /home/hercules/orig/hmacros.h /home/hercules/chng/hmacros.h
--- /home/hercules/orig/hmacros.h   2019-09-07 13:54:30.970177200 +0200
+++ /home/hercules/chng/hmacros.h   2019-09-24 15:12:59.776257512 +0200
@@ -398,7 +398,7 @@
 /*      mainlock is only obtained by a CPU thread                    */
 /*-------------------------------------------------------------------*/

-#define OBTAIN_MAINLOCK(_regs) \
+#define OBTAIN_MAINLOCK_UNCONDITIONAL(_regs) \
  do { \
   if ((_regs)->hostregs->cpubit != (_regs)->sysblk->started_mask) { \
    obtain_lock(&(_regs)->sysblk->mainlock); \
@@ -406,7 +406,7 @@
   } \
  } while (0)

-#define RELEASE_MAINLOCK(_regs) \
+#define RELEASE_MAINLOCK_UNCONDITIONAL(_regs) \
  do { \
    if ((_regs)->sysblk->mainowner == (_regs)->hostregs->cpuad) { \
      (_regs)->sysblk->mainowner = LOCK_OWNER_NONE; \
@@ -414,6 +414,9 @@
    } \
  } while (0)

+#define  OBTAIN_MAINLOCK(_regs)  OBTAIN_MAINLOCK_UNCONDITIONAL((_regs))
+#define RELEASE_MAINLOCK(_regs) RELEASE_MAINLOCK_UNCONDITIONAL((_regs))
+
 /*-------------------------------------------------------------------*/
 /*      Obtain/Release crwlock                                       */
 /*      crwlock can be obtained by any thread                        */
diff -uZ --color /home/hercules/orig/machdep.h /home/hercules/chng/machdep.h
--- /home/hercules/orig/machdep.h   2019-09-07 13:54:31.000000000 +0200
+++ /home/hercules/chng/machdep.h   2019-09-24 15:14:21.438875607 +0200
@@ -133,6 +133,27 @@
         return cc;
     }

+    #if 0
+    #pragma intrinsic ( _InterlockedCompareExchange128 )
+    #define cmpxchg16(  x, y, z, r, s  ) cmpxchg16_x86( x, y, z, r, s )
+    static __inline__ int cmpxchg16_x86( U64 *old1, U64 *old2, U64 new1, U64 new2, volatile void *ptr)
+    {
+        // returns 0 == success, 1 otherwise
+    #if 1 /* debugging */
+        UNREFERENCED( old2 );
+        return ( _InterlockedCompareExchange128( ptr, new2, new1, old1 ) ? 0 : 1 );
+    #else /* debugging */
+        U64 tmp[2] = { *old1, *old2 };
+        int RC = _InterlockedCompareExchange128( ptr, new2, new1, old1 );
+
+        printf( "ICE128 : rc=%d, old1=%0I64X, old2=%0I64X, main1=%0I64X, main2=%0I64X, new1=%0I64X, new2=%0I64X\n",
+                RC, tmp[0], tmp[1], *old1, *old2, new1, new2);
+
+        return (RC ? 0 : 1 );
+    #endif /* debugging */
+    }
+    #endif
+
     #if defined( MSC_X86_32BIT )

       #define fetch_dw_noswap(_p) fetch_dw_x86_noswap((_p))
@@ -375,6 +396,20 @@
  return code;
 }

+#define cmpxchg16(x,y,z,r,s) cmpxchg16_amd64(x,y,z,r,s)
+static __inline__ int cmpxchg16_amd64(U64 *old1, U64 *old2, U64 new1, U64 new2, volatile void *ptr) {
+/* returns zero on success otherwise returns 1 */
+    BYTE code;
+    volatile __int128 *ptr_data=ptr;
+    __asm__ __volatile__ (
+        "lock;   cmpxchg16b %1\n\t"
+        "setnz   %b0\n\t"
+        : "=q" ( code ), "+m" ( *ptr_data ), "+a" ( *old1 ), "+d" ( *old2 )
+        : "c" ( new2 ), "b" ( new1 )
+        : "cc");
+    return (int)code;
+}
+
 #endif /* defined(_ext_amd64) */

 /*-------------------------------------------------------------------
@@ -482,6 +517,76 @@

 #endif /* defined(_ext_ppc) */

+/*-------------------------------------------------------------------
+ * ARM aarch64 (like the Raspberry Pi 4)
+ *-------------------------------------------------------------------*/
+#if defined(__aarch64__)
+
+#ifndef cmpxchg16
+  #define cmpxchg16(x,y,z,r,s) cmpxchg16_aarch64(x,y,z,r,s)
+static __inline__ int cmpxchg16_aarch64(U64 *old1, U64 *old2, U64 new1, U64 new2, volatile void *ptr) {
+/* returns zero on success otherwise returns 1 */
+    int result = 1;
+    U64 expected1 = *old1;
+    U64 expected2 = *old2;
+       __asm __volatile(
+           "ldaxp %[old1], %[old2], [%[ptr]]"
+               : [old1] "+r" (*old1), [old2] "+r" (*old2)
+               : [ptr] "r" (ptr));
+    if ( expected1 == *old1 && expected2 == *old2 )
+    {
+           __asm __volatile(
+               "stlxp %w[result], %[new1], %[new2], [%[ptr]]"
+                   : [result] "+r" (result)
+                   : [new1] "r" (new1), [new2] "r" (new2), [ptr] "r" (ptr)
+                   : "memory");
+    }
+    return result ;
+}
+#endif /* cmpxchg16 */
+
+#endif /* define(__aarch64__) */
+
+/*-------------------------------------------------------------------
+ * C11_ATOMICS_AVAILABLE
+ *-------------------------------------------------------------------*/
+#if defined( C11_ATOMICS_AVAILABLE )
+
+#if defined( cmpxchg1 ) && !defined( C11_ATOMICS_ASSISTS_NOT_PREFERRED )
+  #undef cmpxchg1
+#endif
+#ifndef cmpxchg1
+#define cmpxchg1(x,y,z) cmpxchg1_C11(x,y,z)
+static __inline__ BYTE cmpxchg1_C11(BYTE *old, BYTE new, volatile void *ptr) {
+/* returns zero on success otherwise returns 1 */
+            return __atomic_compare_exchange_n ((volatile BYTE *)ptr, old, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? 0 : 1;
+        }
+#endif
+
+#if defined( cmpxchg4 ) && !defined( C11_ATOMICS_ASSISTS_NOT_PREFERRED )
+  #undef cmpxchg4
+#endif
+#ifndef cmpxchg4
+#define cmpxchg4(x,y,z) cmpxchg4_C11(x,y,z)
+static __inline__ BYTE cmpxchg4_C11(U32 *old, U32 new, volatile void *ptr) {
+/* returns zero on success otherwise returns 1 */
+            return __atomic_compare_exchange_n ((volatile U32 *)ptr, old, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? 0 : 1;
+        }
+#endif
+
+#if defined( cmpxchg8 ) && !defined( C11_ATOMICS_ASSISTS_NOT_PREFERRED )
+  #undef cmpxchg8
+#endif
+#ifndef cmpxchg8
+#define cmpxchg8(x,y,z) cmpxchg8_C11(x,y,z)
+static __inline__ BYTE cmpxchg8_C11(U64 *old, U64 new, volatile void *ptr) {
+/* returns zero on success otherwise returns 1 */
+            return __atomic_compare_exchange_n ((volatile U64 *)ptr, old, new, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? 0 : 1;
+        }
+#endif /* cmpxchg8 */
+
+#endif /* defined( C11_ATOMICS_AVAILABLE ) */
+
 #endif // !defined( _MSVC_ )

 /*-------------------------------------------------------------------
@@ -512,6 +617,24 @@
 #endif

 /*-------------------------------------------------------------------
+ * OBTAIN/RELEASE_MAINLOCK is by default identical to
+ * OBTAIN/RELEASE_MAINLOCK_UNCONDTIONAL but can be nullified in case
+ * the required assists are present unless MAINLOCK_ALWAYS overriden.
+ *-------------------------------------------------------------------*/
+
+#if (! defined( MAINLOCK_ALWAYS )) \
+    && defined( H_ATOMIC_OP )      \
+    && defined( cmpxchg1 )         \
+    && defined( cmpxchg4 )         \
+    && defined( cmpxchg8 )         \
+    && defined( cmpxchg16 )
+  #undef  OBTAIN_MAINLOCK
+  #define OBTAIN_MAINLOCK(_regs) {}
+  #undef  RELEASE_MAINLOCK
+  #define RELEASE_MAINLOCK(_regs) {}
+#endif
+
+/*-------------------------------------------------------------------
  * fetch_hw_noswap and fetch_hw
  *-------------------------------------------------------------------*/
 #if !defined(fetch_hw_noswap)
Fish-Git commented 4 years ago

THANK YOU Peter!   Great work!

... I simply could not manage to get the MSVC _InterlockedCompareExchange128 intrinsic to work.

I am prepared to think that perhaps there is an MSVC bug involved here, but I dare not say that as I am too unfamiliar with MSVC intrinsics.

Perhaps Fish can rescue me here?

I'll certainly try!

There is a chance that the _InterlockedCompareExchange128 intrinsic has simply never been used by anyone before (or if so, not much), so your suspicion that there's a bug in MSVC is entirely reasonable. But let me give it a try myself first, and I'll let you know whether I can manage to get it to work or not.

To maybe expedite things a little bit, can you send me your CDSG test? That might help speed things up for me. Thanks.

And I changed the OBTAIN_ / RELEASE_MAINLOCK around the PLO instruction to OBTAIN_ / RELEASE_MAINLOCK_UNCONDITIONAL, so that all other MAINLOCK usage can become conditional when all cmpxchg1 / 4 / 8 / 16 are assisted.

I'm not seeing that in your patch. I just applied it, and the PLO instruction remains unpatched. Simple oversight perhaps?


p.s. IMO your patch (diff) is little too large to be pasting directly into a GitHub comment. You and others may of course feel differently, but IMO when I have a patch I want to share that's over, say, 60-75 lines or so (approximately), I zip it into a .zip file and then simply attach that .zip file to the comment instead. (Or rename it to .txt and attach that text file).

But as I said, that's just my own feelings on the matter. Do whatever you want! I'm not your (or anyone else's!) boss.

        ;-)

Fish-Git commented 4 years ago

I'm not seeing that in your patch. I just applied it, and the PLO instruction remains unpatched. Simple oversight perhaps?

NEVER MIND!!

I just noticed the PLO instruction is actually in general2.c, not plo.c!

The code in plo.c is just the helper routines that the PLO instruction function in general2.c calls into.  (Doh!)

Sorry about that.     %-þ

Peter-J-Jansen commented 4 years ago

The CDSG test files (the usual .core, .asm, .list and .tst) I did dare committing as they are a pure addition and also work without any source changes, also against a non-assisted cmpxchg16.

Thanks Fish, also for your feedback !

Cheers,

Peter

Peter-J-Jansen commented 4 years ago

This can now be closed as of commit e0052fd. Many thanks to Fish for all the help he provided.

Cheers,

Peter