Peter-J-Jansen commented 3 years ago

After apt-get install libtool, a build from scratch works fine, on Ubuntu 18.04 on a regular Intel PC, as well as on Ubuntu 20.04 (server + LXDE) on a Raspberry Pi 4 (8GB) using the official Raspberry Pi 64-bit download. IPL-ing z/OS 2.4 on the RPI4 works fine, but not so on the PC with Ubuntu 18.04 :

19:13:20 HHC00100I Thread id 00007efe73b56740, prio 5, name 'impl_thread' started
19:13:20 HHC00100I Thread id 00007efe71702700, prio 4, name 'logger_thread' started
19:13:20 HHC01413I Hercules version 4.4.9999.0-SDL-g38d6835b-modified (4.4.9999.0)
19:13:20 HHC01414I (C) Copyright 1999-2021 by Roger Bowler, Jan Jaeger, and others
19:13:20 HHC01417I ** The SoftDevLabs version of Hercules **
19:13:20 HHC01415I Build date: Jul  3 2021 at 18:52:53
19:13:20 HHC01417I Built with: GCC 7.5.0
19:13:20 HHC01417I Build type: GNU/Linux x86_64 host architecture build
19:13:20 HHC01417I Modes: S/370 ESA/390 z/Arch
19:13:20 HHC01417I Max CPU Engines: 16
19:13:20 HHC01417I Using   shared libraries
19:13:20 HHC01417I Using   setresuid() for setting privileges
19:13:20 HHC01417I Using   POSIX threads Threading Model
19:13:20 HHC01417I Using   Error-Checking Mutex Locking Model
19:13:20 HHC01417I With    Shared Devices support
19:13:20 HHC01417I With    Dynamic loading support
19:13:20 HHC01417I With    External GUI support
19:13:20 HHC01417I With    IPV6 support
19:13:20 HHC01417I With    HTTP Server support
19:13:20 HHC01417I With    sqrtl support
19:13:20 HHC01417I With    Signal handling
19:13:20 HHC01417I With    Watchdog monitoring
19:13:20 HHC01417I With    CCKD BZIP2 support
19:13:20 HHC01417I With    HET BZIP2 support
19:13:20 HHC01417I With    ZLIB support
19:13:20 HHC01417I With    Regular Expressions support
19:13:20 HHC01417I With    Object REXX support
19:13:20 HHC01417I Without Regina REXX support
19:13:20 HHC01417I With    Automatic Operator support
19:13:20 HHC01417I Without National Language Support
19:13:20 HHC01417I With    CCKD64 Support
19:13:20 HHC01417I With    Transactional-Execution Facility support
19:13:20 HHC01417I With    "Optimized" instructions
19:13:20 HHC01417I With    OPTION_SIE2BK_FLD_COPY
19:13:20 HHC01417I Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8 cmpxchg16 hatomics=C11
19:13:20 HHC01417I Running on: pjjs12 (Linux-5.4.0-77-generic x86_64) MP=8
19:13:20 HHC01417I Built with crypto external package version 1.0.0.49-g837705e
19:13:20 HHC01417I Built with decNumber external package version 3.68.0.99-gda66509
19:13:20 HHC01417I Built with SoftFloat external package version 3.5.0.102-g42f2f99
19:13:20 HHC01417I Built with telnet external package version 1.0.0.59-g2aca101
19:13:20 HHC00018I Hercules is running in elevated mode
[...]
19:13:45 HHC01603I ipl a80 loadparm 0a8200m1
19:13:45 HHC00811I Processor IP04: architecture mode ESA/390
19:13:45 HHC00811I Processor CP02: architecture mode ESA/390
19:13:45 HHC00811I Processor CP00: architecture mode ESA/390
19:13:45 HHC00811I Processor IP03: architecture mode ESA/390
19:13:45 HHC00811I Processor CP01: architecture mode ESA/390
19:13:45 HHC00811I Processor IP05: architecture mode ESA/390
19:13:45 HHC00814I Processor CP00: SIGP Set architecture mode            (12) CP00, PARM 00000001: CC 0
19:13:45 HHC00811I Processor CP00: architecture mode z/Arch
19:13:45 HHC00107I Starting thread cckd_ra(), active=0, started=0, max=2
19:13:45 HHC00100I Thread id 00007efe60497700, prio 3, name 'cckd_ra thread 1' started
19:13:45 HHC00107I Starting thread cckd_ra() from cckd_ra(), active=1, started=1, max=2
19:13:45 HHC00100I Thread id 00007efe63fff700, prio 3, name 'cckd_ra thread 2' started
19:13:46 HHC00814I Processor CP00: SIGP Unassigned                       (14) CP00, PARM 0000000064000000: CC 1 status 00000002
19:13:49 HHC00006I SCLP console interface active
19:13:49 HHC00107I Starting thread cckd_writer(), active=0, started=0, max=2
19:13:49 HHC00100I Thread id 00007efe63dfd700, prio 1, name 'cckd_writer thread 1' started
19:13:49 HHC00107I Starting thread cckd_gcol(), active=0, started=0, max=1
19:13:49 HHC00100I Thread id 00007efe63cfc700, prio 1, name 'cckd_gcol' started
19:17:07 HHC01603I txf stats
19:17:07 HHC17730I Total CONSTRAINED Transactions =           6
19:17:07 HHC17731I Retries for ANY/ALL reason(s):
19:17:07 HHC17732I 0 retries =           4  (66.7%)
19:17:07 HHC17732I 1 retries =           0  ( 0.0%)
19:17:07 HHC17732I 2 retries =           0  ( 0.0%)
19:17:07 HHC17732I 3 retries =           0  ( 0.0%)
19:17:07 HHC17732I 4 retries =           0  ( 0.0%)
19:17:07 HHC17732I 5 retries =           0  ( 0.0%)
19:17:07 HHC17732I 6 retries =           0  ( 0.0%)
19:17:07 HHC17732I 7 retries =           0  ( 0.0%)
19:17:07 HHC17732I 8+retries =           1  (16.7%)
19:17:07 HHC17733I MAXIMUM   =          53
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC   2 External interruption
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC   4 PGM Interruption (Unfiltered)
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC   5 Machine-check Interruption
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC   6 I/O Interruption
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC   7 Fetch overflow
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC   8 Store overflow
19:17:07 HHC17734I     66099398  (1101656633.3%)  Retries due to TAC   9 Fetch conflict
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  10 Store conflict
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  11 Restricted instruction
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  12 PGM Interruption (Filtered)
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  13 Nesting Depth exceeded
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  14 Cache (fetch related)
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  15 Cache (store related)
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC  16 Cache (other)
19:17:07 HHC17734I            0  ( 0.0%)  Retries due to TAC 255 Miscellaneous condition
19:17:07 HHC17735I            0  ( 0.0%)  Retries due to other TAC
19:17:32 HHC01603I txf stats
19:17:32 HHC17730I Total CONSTRAINED Transactions =           6
19:17:32 HHC17731I Retries for ANY/ALL reason(s):
19:17:32 HHC17732I 0 retries =           4  (66.7%)
19:17:32 HHC17732I 1 retries =           0  ( 0.0%)
19:17:32 HHC17732I 2 retries =           0  ( 0.0%)
19:17:32 HHC17732I 3 retries =           0  ( 0.0%)
19:17:32 HHC17732I 4 retries =           0  ( 0.0%)
19:17:32 HHC17732I 5 retries =           0  ( 0.0%)
19:17:32 HHC17732I 6 retries =           0  ( 0.0%)
19:17:32 HHC17732I 7 retries =           0  ( 0.0%)
19:17:32 HHC17732I 8+retries =           1  (16.7%)
19:17:32 HHC17733I MAXIMUM   =          53
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC   2 External interruption
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC   4 PGM Interruption (Unfiltered)
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC   5 Machine-check Interruption
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC   6 I/O Interruption
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC   7 Fetch overflow
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC   8 Store overflow
19:17:32 HHC17734I     75213921  (1253565350.0%)  Retries due to TAC   9 Fetch conflict
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  10 Store conflict
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  11 Restricted instruction
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  12 PGM Interruption (Filtered)
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  13 Nesting Depth exceeded
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  14 Cache (fetch related)
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  15 Cache (store related)
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC  16 Cache (other)
19:17:32 HHC17734I            0  ( 0.0%)  Retries due to TAC 255 Miscellaneous condition
19:17:32 HHC17735I            0  ( 0.0%)  Retries due to other TAC
19:17:52 HHC01603I quit
19:17:52 HHC01420I Begin Hercules shutdown

The workaround is to git checkout a6fb047bc1d13bfdffd57d45fb36d811a632412b (a6fb047bc1d13bfdffd57d45fb36d811a632412b) prior to the ./configure and make etc.

Cheers,

Peter

Fish-Git commented 3 years ago

Peter (@Peter-J-Jansen),

Does this problem still exist in current git, or has it since been corrected? I am unable to recreate it myself, and would like to close this issue if possible.

Thanks.

Peter-J-Jansen commented 3 years ago

I will try this again and update this Issue accordingly.

(I've been quite busy working on circumventing the z/OS Enterprise Extender restriction on NAPT, which just an hour ago I was able to solve -- but that's another story, actually not even Hercules-only related.)

Cheers,

Peter

Juergen-Git commented 3 years ago

Hi Fish

I just gave the current state -- Hercules version 4.4.9999.0-SDL-g7bcdc7e6 (4.4.9999.0), Built with: GCC 7.5.0, Build type: GNU/Linux x86_64 host architecture build -- a quick whirl. Just after the end of the IPL sequence (roughly when terminals get enabled) a CP system failure code HTT001 occurs. The issue is reproducible, and it doesn't yet occur at a6fb047 which matches Peter's original observation. However, the operating system is of course different.

I don't know whether we are seeing the same problem here or a different one, but for the time being I think this should be handled in the same issue.

Cheers Jürgen

Fish-Git commented 3 years ago

I just gave the current state -- Hercules version 4.4.9999.0-SDL-g7bcdc7e6 (4.4.9999.0), Built with: GCC 7.5.0, Build type: GNU/Linux x86_64 host architecture build -- a quick whirl.

Which is the same version of GCC that Peter used.

Just after the end of the IPL sequence (roughly when terminals get enabled) a CP system failure code HTT001 occurs.

That's not good. :(

The issue is reproducible

But THAT is!

Being able to reproduce a problem helps tremendously with being able to track down the root cause of it.

So... How do I do that? Reproduce the problem? z/VM 7.1? Is that all it takes? Because I've tested with z/VM 7.1 probably a zillion times and haven't experienced any problems whatsoever. But then I'm using Windows and a version of Hercules built using MSVC too of course, and the problem -- whatever it is -- seems to be related to a specific version of GCC.

However, the operating system is of course different.

Which I doubt has anything to do with the problem, other than maybe that both the operating system you are using as well as the one that Peter used both use (install?) GCC 7.5.0 by default?

Have you tried installing or upgrading GCC to a different version? Or using clang instead? Does trying either one of those make a difference?

(grumbleihatefrickinggccgrumbleitscausedmemoregriefthananyotherproductinexistencegrumble)

Juergen-Git commented 3 years ago

So... How do I do that? Reproduce the problem? z/VM 7.1? Is that all it takes?

Exactly! That's all it takes. ;-)

And no, I didn't yet try using a different gcc version or clang: I don't think that this would be a good idea, unless we are really sure that this is a compiler problem. As you know, I'm for long suspecting (and you somehow too, I think) that we have somewhere a well hidden DAT bug. And what is a HTT001? It is a translation exception! So, pursuing this might all of a sudden point us to this potential bug -- while trying to hide it by just using a different compiler might prevent us from analysing it.

Cheers Jürgen

Fish-Git commented 3 years ago

So... How do I do that? Reproduce the problem? z/VM 7.1? Is that all it takes?

Exactly! That's all it takes. ;-)

Well crap. :(

Then it looks like we have a gcc problem to deal with, because as I said in my reply, I've been running tests with z/VM 7.1 for a long time now and have never experienced even a single HTT001 that I can remember. :(

And no, I didn't yet try using a different gcc version or clang: I don't think that this would be a good idea, unless we are really sure that this is a compiler problem.

And precisely HOW, pray tell, can we possibly determine whether or not it might be a compiler problem, Jürgen? That's right! Try a different (newer or older) version of the compiler to see if the problem goes away! If it does, then I believe that would be rather strong evidence of a likely compiler bug! Wouldn't you agree? Either that or a combination of operating system and compiler bug? (i.e. the compiler generating code that just happens to trip a bug that exists only in specific versions of certain operating systems?)

I believe that the fact that I am unable to reproduce this HTT001 crash on Windows (whereas it is easily reproducible on *Nix using GCC 7.5.0) should by itself be strong evidence of a likely compiler bug IMHO. I mean, both are using the exact same source code! Yes?

So IMHO it would definitely be worthwhile to try either a different version of your existing compiler or a different compiler altogether. Wouldn't you agree?

Juergen-Git commented 3 years ago

Surely it will be a helpful part in the puzzle to see whether different gcc versions or even clang produce different results. But those tests will not be conclusive as long as we don't know what precisely causes the problem on *Nix with whatever compilers that are exhibiting it.

So, to get that puzzle part, I'll do some tests with different compilers. But, regardless of the outcome, it will still be necessary to do an in depth analysis of the problem using those compilers that exhibit it...

Fish-Git commented 3 years ago

But those tests will not be conclusive as long as we don't know what precisely causes the problem on *Nix with whatever compilers that are exhibiting it.

I believe it would be 100% conclusive as to whether or not it was the compiler that was causing the problem or not.

Now whether we could thus conclude that the compiler contained a "bug" or not would be inconclusive, I agree. It might be that the compiler is simply behaving "differently" from the problematic compiler but not necessarily incorrectly. It might be behaving correctly but the different machine code that it generates trips a bug in Hercules code. I will agree that such is indeed a possibility.

A very unlikely possibility IMO, but a possibility nonetheless.

The only way to determine for sure would be to compare the machine code that each compiler version generates to see if the any of the differences can explain the different behavior. Only then would we then be able to place the blame where it belongs.

Bottom line, trying a different compiler (or a different version of the existing compiler) is not, IMHO, a waste of time. Especially given that, as I have been saying, the exact same source code compiled by a different compiler (MSVC in this case) does not exhibit the undesirable behavior (as well as the very suspicious IMO fact that those who are experiencing the undesirable behavior are both using the same compiler version as the other).

wrljet commented 3 years ago

Unrelated failure, but related to the blame-the-compiler-first discussion...

I have spent weeks, literally, running different versions of gcc and Clang, with different optimization switches, on three different CPUs (x86-64, and two ARMs) and different kernels, trying to track down a semi-random failure during MVS sysgen process. I have found it always eventually fails on every system, in similar, but still different ways. Sometimes it runs through correctly 30 times, but eventually it'll crap out.

Nothing points to the compiler for the thing I'm chasing.

Bill

Fish-Git commented 3 years ago

I have spent weeks, literally, running different versions of gcc and Clang, with different optimization switches, on three different CPUs (x86-64, and two ARMs) and different kernels, trying to track down a semi-random failure during MVS sysgen process.

Semi-random?

I have found it always eventually fails on every system, in similar, but still different ways.

Interesting... That would definitely seem to point the finger in the direction of Hercules. Or MVS? Have you been able to rule out MVS yet? Is it definitely a Hercules bug? What are the symptoms? Is it with address translation? Jürgen's HTT001 failure in z/VM points to a problem with address translation. Is your problem related to address translation too? Can you provide any specifics regarding it? Maybe we should open a new GitHub Issue for this?

Or maybe (and I hate to say this!) we need to backout (revert) all of my effort to use inline instead of static inline and go back to using ONLY static inlines? (since gcc doesn't appear to like them apparently!)

(I'm beginning to hate gcc more and more as time goes by!)

Fish-Git commented 3 years ago

Jürgen:

Here's a thought: maybe my educated guess regarding use of the of the SIE_RCPO0_SKAIP "SKA in progress" flag as the lock bit for SIE PGSTE table entry accessing is wrong? Have you tried building with the OPTION_USE_SKAIP_AS_LOCK #define in featall.h commented out? Does that resolve your HTT001 problem or not?

Or maybe it was my removal of the #ifdef in sie.c where the rcpo is being loaded that is causing the problem? But if that was true, then why would it work just fine for me on Windows?

(both introduced by commit 1538dc632d8f22476915d590fba6379c8b579a0f)

Whatever is going on is REALLY WEIRD and incredibly frustrating given that it seems to only occur for those using gcc!

If it does turn out to be a bug I accidentally introduced however, I sincerely apologize for it! I must be losing my touch. But I promise you I tested these changes six ways to Sunday! (But of course not on Linux. I only verified that Hercules built cleanly on Linux, but my Linux test virtual machine is not setup to actually run any real Hercules workload. I don't have things setup to run z/OS or z/VM or anything else. I've tried to do that in the past but only ran into a zillion problems. Linux doesn't seem to like me too much I'm afraid.)

(I hate this shit!) :(

Juergen-Git commented 3 years ago

Well, well, well...

15:29:02 HHC01414I (C) Copyright 1999-2021 by Roger Bowler, Jan Jaeger, and others
15:29:02 HHC01417I ** The SoftDevLabs version of Hercules **
15:29:02 HHC01415I Build date: Aug  5 2021 at 15:15:26
15:29:02 HHC01417I Built with: Clang 6.0.0 (tags/RELEASE_600/final)
15:29:02 HHC01417I Build type: GNU/Linux x86_64 host architecture build
. . .
15:33:16 HCPWRP959I IBMSYS1  SYSTEM TERMINATION IN PROGRESS ON 2021-08-05
15:33:16 HCPDMP908I SYSTEM FAILURE ON CPU 0000, CODE - HTT001

So, then, I'd propose to go for serious analysis now. This isn't a compiler problem (except maybe a problem of MSVC not producing this problem ;-)).

Fish-Git commented 3 years ago

Well, well, well...

15:29:02 HHC01417I Built with: Clang 6.0.0 (tags/RELEASE_600/final)
. . .
15:33:16 HCPDMP908I SYSTEM FAILURE ON CPU 0000, CODE - HTT001

Dang! :(

(hmmm...... who else can I blame other than Hercules? ..... hmmmm ...... think fish, think ...... Global warming? ...... I know! COVID! The new Delta variant! Yeah! THAT'S what's causing it!)

So, then, I'd propose to go for serious analysis now. This isn't a compiler problem (except maybe a problem of MSVC not producing this problem ;-)).

Yeah, yeah... It can't be Open Source's golden child gcc or its variant clang. It's got to be evil/stupid Microsoft...

(grumble)

:(

wrljet commented 3 years ago

Fish wrote in part:

(I hate this shit!) :(

You love it!

Peter-J-Jansen commented 3 years ago

Confirming Jürgen's test and answering Fish's question about 3 days ago : yes, the problem still exists for me as well, so I keep using the a6fb047 commit to enable me continue working on my EE stuff (which actually interrupted my TXF Backout Method efforts ...).

But also confirming my own initial statement, on my RaspberryPi 4B 8GB using Ubuntu 20.04.2 the problem does not occur, similar to Fish's Windows system? But I don't know how this can help finding & fixing the problem.

Love ? Hate ? -- What a hobby of ours, right ? Keeps us off the streets and young and healthy. :-)

Cheers,

Peter

wrljet commented 3 years ago

Confirming Jürgen's test and answering Fish's question about 3 days ago : yes, the problem still exists for me as well, so I keep using the a6fb047 commit to enable me continue working on my EE stuff (which actually interrupted my TXF Backout Method efforts ...).

If there is a known working version, git bisect can be used to home in on the trouble.

Fish-Git commented 3 years ago

(I hate this shit!) :(

You love it!

LOL! Not really, no. :))

I like finding other people's bugs, but not my own. (I hate making mistakes!)

Well, that's not strictly true either. I don't really like finding bugs at all, other's or my own, since I hate the existence of bugs themselves. The fact that a one exists that needs to be found bothers me.

But if there exists a bug to be found, I do very much enjoy the sense of satisfaction that occurs once it is found and squashed. THAT I like. A lot.

But the actual hunt for the bug? (especially ones that are difficult to find?) THAT I definitely do not like!

And I especially do not like doing that for bugs that are caused by ME! That REALLY pisses me off!!

Fish-Git commented 3 years ago

So, then, I'd propose to go for serious analysis now. This isn't a compiler problem (except maybe a problem of MSVC not producing this problem ;-)).

Well, if it is MSVC not producing the problem, it is something MSVC has been apparently doing for the past 11 years, since z/VM works just fine for me when compiled with VS2008 as well as when compiled using VS2019. That's two completely different(*) compilers that both work just fine.

ON WINDOWS.

Now, not being experienced or knowledgeable with things Linux, I can't remember: does clang use gcc under the covers? Or is it a 100% completely different compiler?

If gcc and clang are both 100% completely different from one another, then the problem MUST be related to my use of "inlines" in my recent "skeys" commit.

Which the changing of static inlines to ordinary inlines instead is what kicked off this whole frigging mess over 2 weeks ago.

Which is why I hate gcc so much. ):-<

(*) At least I'm presuming they're completely different (or at the very least, very different) from one another, having been enhanced and extended and updated over the past 11 years.

Juergen-Git commented 3 years ago

except maybe a problem of MSVC not producing this problem ;-)

Of course I meant that as a joke! I don't want to imply an MSVC problem just because there apparently isn't a problem with gcc.

Fact is, that we cannot conclude anything concerning the compilers: Neither that any of them has a bug, nor that it hasn't. Buggy compiler(s) are just one -- and not the most probable -- of the many possibilities that could lead to the issue at hand.

So, as always in cases where no highly probable cause becomes immediately visible, deep diving is required. I will go for it using my usual way: Tracing, with a particular eye on what's going on in the DAT tables. ;-)

Maybe others have ideas on how to find out easier?

(Note, Bill, that I'm well aware of git bisect. The problem however is, that I'm not convinced that we have a known working version, as there is a certain probability that we are seeing yet another symptom of an older potential DAT bug -- that's why I want to catch it analytically.)

Fish-Git commented 3 years ago

Fact is, that we cannot conclude anything concerning the compilers: Neither that any of them has a bug, nor that it hasn't. Buggy compiler(s) are just one -- and not the most probable -- of the many possibilities that could lead to the issue at hand.

That's where you and I disagree. :)

I feel that the fact that it works just fine on Windows is strong enough evidence to place the blame on the compiler being used.

...with a particular eye on what's going on in the DAT tables

... as there is a certain probability that we are seeing yet another symptom of an older potential DAT bug ...

Which I am seriously doubting exists. After all, if there did exist some type of DAT bug in Hercules, why would it only occur with gcc? It should also occur on Windows too. Since it doesn't, that tells me the problem is with the compiler generating incorrect code.

Now it may well be that Hercules is still the one to blame (in fact I consider this to be the most likely cause) and not gcc (the compiler) per se. We are likely still not specifying inline functions correctly (i.e. we're probably not specifying them "correctly" from gcc's point of view, i.e. we may not be specifying them in a "gcc proper" or "gcc compatible" manner), but if that was the case you would think that gcc would warn us about the incorrectness or incompatibility. But it's not. Instead, it's simply generating bad code.

Which is one of the things about gcc that annoys me most: its habit of warning us about things that are really nothing to worry about while at other times failing to warn us about things that we should be warned about.

(sigh) We'll figure it out eventually, and your deep diving into the problem via instruction tracing with a particular eye on anything to do with DAT (since a HTT001 seems to imply that's where the problem (bad code) is) sounds like the proper next step to be taken, so I thank you, Jürgen, for your willingness to take on this effort. I truly do, and apologize ahead of time for my changes to Hercules having triggered this whole mess. Of that I am truly sorry. :(

In the mean time my goal today is to try to get my two Linux VMware virtual machines (CentOS 6.10 and Neon 5.22) setup to be able to run z/VM (I won't have any networking of course but I'm hoping that won't be a problem) just to see how they behave. They both have completely different versions of gcc and clang installed, which are completely different from the ones you and Peter have installed on your systems, so it should be an interesting test I think.

More later as it occurs.

Juergen-Git commented 3 years ago

After a very first trace it becomes clear, that the translation exception is thrown correctly: The error occurs at an "LG R1,0(R3,R6)" instruction, in which R3 contains a high number in case of a failing run, while it contains 0 in case of a successful run. That high number causes the resulting address to point to a not currently addressable page, correctly causing a page-translation exception to get recognized.

So, we are not searching for a DAT problem (lucky you, Fish) but for something else that causes various registers to contain unexpected and probably erroneous data.

Fish-Git commented 3 years ago

BREAKING NEWS!

I was successful in my attempt to setup both of my Linux virtual machines to try and run z/VM 7.1.

Here is the result of my attempt to IPL z/VM 7.1 on each system:

CentOS 6.10: gcc 4.4.7 clang 3.4.2: disabled wait 9022!

Neon 5.22: gcc 9.3.0 clang 10.0.0: disabled wait 9022!

Ref: https://www.ibm.com/docs/en/zvm/7.1?topic=messages-hcp9022w

So it should seem obvious to everyone by now that Hercules is behaving highly unusually and incorrectly when built on Linux, so the problem, whatever it is, is a Linux-only problem, and the only common denominator seems to be gcc and clang.

This sounds like a compiler bug to me folks, given that the compiler did not issue any notable warning or error messages, but nevertheless created an executable that clearly does not run correctly.

I haven't tried gcc yet, but I will get around to doing so eventually, just to see if the results are the same.

Before I try gcc however, I'm going to try checking out commit eb1cf2b6b1379ec64c69d2f93ec25d027f2313c3, which is the commit immediately before I started changing static inlines to extern inlines (which is what I believe may well be causing this unusual incorrect behavior).

Fish-Git commented 3 years ago

CentOS 6.10: gcc 4.4.7 clang 3.4.2: disabled wait 9022!

Neon 5.22: gcc 9.3.0 clang 10.0.0: disabled wait 9022!

Here are the config and resulting log files:

402-test.zip

Juergen-Git commented 3 years ago

Hi Fish

Note, that you are seeing the same error now than I do: The 9022 isn‘t the primary error, the z/VM console will show, that it is an HTT001 (and tracing shows, that even the HTT001 isn‘t the primary error, see my previous comment).

Cheers Jürgen

Juergen-Git commented 3 years ago

and the only common denominator seems to be gcc and clang.

Well, I still don't fully believe in the "blame the compiler" theory. Remember, that Peter reported the problem not to happen when using gcc on an Raspberry Pi. Maybe, I'll give clang on MacOS (Intel) a go, to see how it behaves there...

Fish-Git commented 3 years ago

and the only common denominator seems to be gcc and clang.

Well, I still don't fully believe in the "blame the compiler" theory.

I do.

But then I admit to being somewhat prejudice against gcc too.

Remember, that Peter reported the problem not to happen when using gcc on an Raspberry Pi.

Raspberry Pi is a different CPU architecture than x86-64. The bug in gcc is obviously related to x86 code generation.

Maybe, I'll give clang on MacOS (Intel) a go, to see how it behaves there...

I'll bet you one "HA! I told you so!" that it fails there too.

A better test might be MacOS (M1)? (i.e. different CPU architecture)

wrljet commented 3 years ago

Fish,

Are you testing with just your go-to VS2008, or using modern VS2017/2019 as well?

Bill

Fish-Git commented 3 years ago

Are you testing with just your go-to VS2008, or using modern VS2017/2019 as well?

Both.

And VS2019 works fine. See my earlier comment regarding this. (I no longer have VS2017.)

wrljet commented 3 years ago

Ah, OK, sorry, I missed that post.

Juergen-Git commented 3 years ago

Maybe, I'll give clang on MacOS (Intel) a go, to see how it behaves there...

I'll bet you one "HA! I told you so!" that it fails there too.

Back from an excursion to MacOS...

15:22:53 HHC01417I ** The SoftDevLabs version of Hercules **
15:22:53 HHC01415I Build date: Aug  9 2021 at 14:58:05
15:22:53 HHC01417I Built with: Apple Clang 12.0.5 (clang-1205.0.22.11)
15:22:53 HHC01417I Build type: Mac OS X x86_64 host architecture build
15:22:53 HHC01417I Running on: Juergens-Mac.local (Darwin-20.6.0 64-bit 64-bit ) LP=2, Cores=2, CPUs=1

... and cured: The Mac OS build seems to be severely broken, at the current level as well as at the a6fb047 reference level, which I didn't realized immediately as I blamed my stone-old Mavericks VM first. Only after upgrading to Big Sur I saw that the build is broken, at both levels, both due to libtools issues, but different ones. Well, I must say, I hate this. I only "fixed" it rudimentary to get the builds done, to be able to test the issue at hand.

And yes, Fish, you win that bet: The reference level works nicely with z/VM 7.1, while the current level throws the well known HTT001.

A better test might be MacOS (M1)? (i.e. different CPU architecture)

Sadly, I don't have an M1 Mac (and there doesn't exist an M1 emulation supporting Mac OS currently, afaik). But, given that other ARM platforms do work with clang and gcc, I've no doubt that the M1 would work too. So, we can spare this test.

In total we now know:

It's not an Intel platform issue: Otherwise it wouldn't work on Windows too.
It's not a Linux issue: Otherwise it wouldn't work on ARM based Linuxes too.
It's most probably not a Mac OS issue: Same argument, though M1 not (yet) tested.
While it still can be a potential, Intel specific, gcc and clang bug, I still think this is very low probability: If both compilers exhibit that exact same issue across a wide range of versions and across two very different operating systems, it's almost impossible that such a bug goes undetected for years.

So, I still rule out a compiler bug. Sorry. :-(

I'll now straightforwardly go ahead to find the exact commit that makes the issue occur, which shouldn't be difficult. Maybe, we'll see light then.

My personal hypotheses is, that some preprocessor stuff (#ifdefs and the like) got messed up -- but let's wait and see.

wrljet commented 3 years ago

The MacOS build should work well for you since my recent autoconf updates. Others have used it successfully. on x86-64 and Apple M1 CPU.

You'll need the Xcode command line tools and Homebrew, which can be installed with (if you don't have them):

xcode-select --install /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

You'll need some packages from Homebrew: autoconf automake cmake gsed (there may be others, but I think that is enough. I haven't run it from complete bare for a while)

An example configure is:

mkdir build
cd build

../configure --enable-optimization="-O2 -march=native" \
--enable-extpkgs=/Users/bill/herctest/extpkgs --prefix=/Users/bill/herctest/herc4x \
--enable-regina-rexx --disable-getoptwrapper

On an M1 CPU you'll need to also add --without-included-ltdl

Bill

Juergen-Git commented 3 years ago

Thanks for commenting, Bill!

Too bad, I'm a MacPorts user and I don't like to mix Homebrew and MacPorts. Generally, I think our build procedure should not limit the user to a specific package manager... but then, who am I, I can live with it. :-)

Cheers Jürgen

wrljet commented 3 years ago

Jürgen,

I never touched a Mac before this recent stuff. In fact I've yet to actually touch an Apple computer. Everything was done in VMs, and an ssh to the Mac Mini M1.

The state of autoconf affairs before these recent changes, there were parts in there 15 years old and older. Which had no clue about MacOS (and many other things).

I will look up MacPorts and see what's needed to make it play nice!

Bill

Fish-Git commented 3 years ago

... While it still can be a potential, Intel specific, gcc and clang bug, I still think this is very low probability: If both compilers exhibit that exact same issue across a wide range of versions and across two very different operating systems, it's almost impossible that such a bug goes undetected for years.

Perhaps. Perhaps not. But consider that the part of gcc/clang that we're exercising is likely quite uncommon. I suspect most projects likely use static inlines like Hercules was originally doing (and still is doing too for a lot of functions!), and not the plain/extern inlines technique that Hercules only recently started using.

Then too, it could be some arcane/rare bug in gcc/clang's inline handling that Hercules just happens to be tripping that most other projects also using inlines aren't tripping. Something "unusual". Like maybe it's the inline functions in machdep.h that are inline assembler functions as opposed to ordinary C language functions? I.e. something highly unusual and unexpected that no other projects are doing? Some weird/unusual/unexpected bug that Hercules just happened to be the first to trip? <shrug>

Such highly unusual/rare bugs can easily exist for years, across many versions/releases, before eventually being exposed.

So, I still rule out a compiler bug.

And I still don't. :)

(I'm starting to sound like a broken record!)

Sorry. :-(

No need to be sorry! You may well be right! But IMO the evidence pointing to a compiler bug is quite strong. Presumably valid C source code is compiled without complaint (and that I think is the important thing to keep in mind) but yet fails to execute properly. That to me is the classic definition of a compiler bug! Any situation that a compiler encounters that causes it to reach a point in its code where it's unable to determine the correct code that it should generate should IMHO cause either an error or, at the very least, a warning message to be issued.

But we're seeing neither.

Which to me is conclusive evidence of (not to mention the classic definition of) a compiler bug.

My personal hypotheses is, that some preprocessor stuff (#ifdefs and the like) got messed up -- but let's wait and see.

Unlikely IMO. It does work just fine on Windows AND on non-Intel Linux too. The problem appears to only be specific to Intel Linux (or more correctly, Intel target architecture).

Trust me. It's a compiler bug. ;-)

_(But if it does turn out to be something stupid I've done, I might just need to retire from Hercules development due to my apparent dementia!)_

Fish-Git commented 3 years ago

Then too, it could be some arcane/rare bug in gcc/clang's inline handling that Hercules just happens to be tripping that most other projects also using inlines aren't tripping. Something "unusual". Like maybe it's the inline functions in machdep.h that are inline assembler functions as opposed to ordinary C language functions? I.e. something highly unusual and unexpected that no other projects are doing? Some weird/unusual/unexpected bug that Hercules just happened to be the first to trip?

Or maybe it's not gcc/clang but rather is libtool??

Some of our functions after all (those that I changed from static inline to just inline) are being exported because some non-engine DLLs (loadable modules or shared libraries I believe they're called in Linux parlance) are calling them too, such as vstorec, vfetchc, validate_operand, and get_dev_2K/4K_storage_key, or_dev_2K/4K_storage_key, or_storage_key_by_ptr. Maybe exporting/importing inline functions isn't supported by libtool? (or isn't being supported correctly?)

My point is, what we're currently doing is supposedly completely valid from gcc/clang's point of view (and/or from libtool's point of view too?) as evidenced by the lack of any error or warning messages, but yet incorrect code is obviously being generated.

And that, to me, means "compiler bug". (or else libtool bug?) (or both?)

Juergen-Git commented 3 years ago

Well, Fish, interesting theories... but reality is different: We have neither a compiler nor a libtool bug, the culprit is commit d3242d2. Exactly beginning with this one, the issue starts to occur. Given that the changes made there may well be sensitive to alignment issues, it is very well possible that singularities become visible in some environments but not in others.

So, I'm assuming you will want to take over from here? ;-)

Cheers Jürgen

Fish-Git commented 3 years ago

...the culprit is commit d3242d2

!!!

Wow. I would never have suspected that commit. A 10 year old time bomb planted by Paul Gorlinsky that my innocent change a month ago ended up tripping over! Damn! :(

Okay, so I guess I have to say it now: "You were right and I was wrong!" It seems your instincts are sharper and more finely honed than mine. Or perhaps put more honestly: you're obviously not as prejudiced against gcc/Linux as I am!

I'll try to keep my personal prejudices under control from now on. I hate prejudice, and I'm extremely embarrassed and deeply ashamed that I failed to put a damper on it. I'll try harder from now on.

So, I'm assuming you will want to take over from here? ;-)

I wish I could but I'm afraid I can't. :(

I don't know (nor do I want to know!) Intel or gcc inline-assembler, so someone else is going to have to fix this particular bug. I'm afraid it's beyond my abilities. :(

Oh sure, I could easily revert my change to the TB (Test Block) instruction to go back to doing a memset instead of clear_page_4K, but that wouldn't be fixing the actual bug. Doing so would leave the time bomb in place, leaving ourselves open to accidentally triggering it again at some point in the future. The gcc SSE2 __clear_page function in hinline.h should be fixed, not control.c's TB instruction IMHO.

So who among our team knows gcc inline Intel assembler well enough to tackle this task? Anyone?

wrljet commented 3 years ago

Is z/OS 2.4 one of those things you need a legal license to run?

Fish-Git commented 3 years ago

Is z/OS 2.4 one of those things you need a legal license to run?

AFAIK, yes. Why? Is z/OS 2.4 impacted too? I thought it was only z/VM 7.1 that was impacted.

wrljet commented 3 years ago

I believe we should be able to fairly easily replace that horrible gcc assembly syntax with the SSE intrinsics (similar to the MSVC case).

From what I've read, gcc's inline assembly of the SSE instructions is buggy. :-)

wrljet commented 3 years ago

Is z/OS 2.4 one of those things you need a legal license to run?

AFAIK, yes. Why? Is z/OS 2.4 impacted too? I thought it was only z/VM 7.1 that was impacted.

zOS 2.4 is what is mentioned in the first msg in this issue.

Bill

Fish-Git commented 3 years ago

zOS 2.4 is what is mentioned in the first msg in this issue.

(DOH!) Yes. Of course it is. How silly of me.

:zany_face:

Juergen-Git commented 3 years ago

Hi Bill

I believe we should be able to fairly easily replace that horrible gcc assembly syntax with the SSE intrinsics (similar to the MSVC case).

From what I've read, gcc's inline assembly of the SSE instructions is buggy. :-)

Yes, I think so too. It doesn't make much sense trying to fix that assembler coding, given we've ready made intrinsics for that nowadays.

Cheers Jürgen

Juergen-Git commented 3 years ago

Hi Fish

Okay, so I guess I have to say it now: "You were right and I was wrong!"

Never mind... I'd say, you owe me a beer, should I ever manage to come to Seattle. :-)

Cheers Jürgen

Fish-Git commented 3 years ago

Never mind... I'd say, you owe me a beer, should I ever manage to come to Seattle. :-)

Deal! :)

Juergen-Git commented 3 years ago

So then, I'll go for the _GCCSSE2 version of the __clear_page function now -- doesn't look like being a big deal. ;-)

Juergen-Git commented 3 years ago

I've put a mitigation for the issue in place, so our users are no longer impacted. This buys me time to go after fixing __clear_page.

Peter-J-Jansen commented 3 years ago

Super Jürgen!

And sorry I haven't become more involved myself; I'm still entrenched in z/OS's Enterprise Extender over NAPT ...

Cheers,

Peter

Fish-Git commented 3 years ago

Closed by commit 77da714fe78537ef4480a636ecee80dee855a691.

wrljet commented 3 years ago

See my comment here:

https://github.com/SDL-Hercules-390/hyperion/commit/77da714fe78537ef4480a636ecee80dee855a691#commitcomment-54809918

SDL-Hercules-390 / hyperion

The most recent commit (38d6835b) on Ubuntu 18.04 causes an IPL problem. #402

:zany_face: