Improve the mechanism for testing Bluespec Inc bug 1490

quark17 commented 6 months ago

With GHC 9.8.1, the following failures occur in bsc.bugs/bluespec_inc/b1490/:

FAIL: module `' in `Bug1490MyUnion.bsv' should compile to Verilog
FAIL: module `' in `VsortWorkaround.bsv' should compile to Verilog

This directory is testing that certain examples will compile within a 256M heap. The two failures are because, with GHC 9.8.1, they now exceed the heap and abort. Presumably there has been some change in GHC 9.8.1 that now takes more heap. Unfortunately, I don't know how to debug this, to decide if this is a significant regression in GHC, or what.

I guess I can test whether it compiles with slightly more heap. I do see that MyUnion compiles with 260M and VSortWorkaround with 265M. So maybe the new GHC just uses slightly more heap and we just need to increase the limit a little (at least for newer GHC versions) -- and not that there's a significant change in the order of growth, say. However, note that the tests have a parameter that controls the size: the Bool test, for example, defines Ports, which can be increased to scale up the test (as been increased, from 16 to 21, in the past). So someone has constructed these as a test of BSC's scaling/optimization; but it's unfortunately also tied to the performance of GHC (which changes from version to version). It would be nice if there was a better way to write these tests, or some information on what to look for when adjusting them.

However, another measurement we can do is, with earlier GHC, see how low the heap max can be decreased before the test fails. Increasing the heap from 256 to 265 for GHC 9.8.1 is "insignificant" only if we assume that the earlier GHC was using nearly 256. But that doesn't seem to be the case! With BSC built with older GHC (I think 9.4.8 or 9.4.7), I can drop the -M flag down to the minimum (4K) without any change in the testsuite results! What's going on there?! With newer GHC, I can also decrease the flag to 4K and still only those two tests fail. Am I misunderstanding how the flag works, or are those tests really using less than 4K heap? (This was on macOS 11. I also tested on an old Debian, with GHC 8.8.4, and lowering to 4K also worked fine there.)

The fact that two tests now need more than 256M which used to work fine with 4K seem like a significant change! If that's true, is there some way to report this example to the GHC developers?

I also don't know if there's still a bug in BSC to be fixed here. Bug 1490 (in Bluespec's internal database before open sourcing) was "Evaluator blows up on certain for loops". There was a commit in 2008 claiming to fix it, then additional tests in 2009, but I see an email from 2010 claiming that bug 1490 still needs to be fixed; specifically, an attempt to fix Bug 1429 ("Incomplete optimization of selection from conditional expressions") made bug 1490 worse, so it was backed out "until bug 1490 is fixed". The code that was backed out can still be seen commented out in ITransform if you search for "bug 1490" or "bug 1429". The original code that fixed 1490 was in ITransform in the handling of PrimIf (in doIf, by adding a new improveBoolIf for boolean return types), but that code is gone. There were some big improvements to the handling of if-else trees (and undefined values) made in 2021 (some commits dating from 2018, but merged in 2021), so I'm not surprised if the current code looks a lot different -- also, those changes would have been developed with an eye on the b1490 tests and making sure not to blow up the heap usage. So probably there's no lingering known issues? Well, I guess the fact that VSortOriginal is known to exhaust the heap is a bug? (unless we think the original example is written in a style that we expect to perform poor)

So what should we do?

We can add a check to the .exp file to look for GHC 9.8 and do something different. That "something different" could be to run with 256M (or 4K?) and confirm that it fails and then run with ~265M and confirm that it passes. (For other GHC versions, do we lower the test to 4K?) This would resolve the testsuite for now (while still also detecting if the heap usage goes back down at some point).

But I'd also like to know if there some better way to write these tests? And I would like to know if it's worth submitting this to the GHC developers and, if so, is there a way to boil it down?

quark17 commented 4 months ago

I confirmed that this is still a problem with the recently released GHC 9.8.2. And I also see that it's an issue with 9.6 (both 9.6.2 and 9.6.4 the latest) -- I would guess that we just jumped to 9.8.1 and hadn't run the testsuite with 9.6. I opened an issue with GHC for it (issue 24487). If I use +RTS -s to dump statistics, I don't observe much of a difference between 9.4.8 and 9.8.2, nor is the runtime much different, so I'm actually not sure if this is a behavior difference in BSC and maybe it really is more directly GHC's doing. (That could maybe be confirmed by dumping a trace from BSC of its evaluation steps, and confirming whether BSC is doing the same evaluation in both cases.)

quark17 commented 4 months ago

It appears that GHC 9.4 and 9.8 are using the amounts of memory, it's just that GHC 9.4 was not properly handling the RTS -M flag. (As indicated on the GHC issue, there was a fix to the accounting of heap space that was introduced in 9.6). On the GHC issue, the are graphs generated by using the RTS -S flag to generate trace of heap usage and actual live bytes (which may be less than the heap size due to minimum allocatable block sizes and fragmentation) etc. Those graphs are essentially the same for 9.4 and 9.8.

So GHC 9.8 is doing the right thing and the two failing tests just need a little more heap space to compile. (This is in contract to the VsortOriginal example that exhausts a 400M heap, and when I gave it 500M it churned for longer than I was willing to wait.) So I've submitted PR #681 to remove the check of the GHC version and just slightly increase the heap given to the two newly failing tests.

I wonder if these tests need to be restructured, to somehow directly measure what is actually being tested. I don't immediately understand what the tests are testing, or I would propose something. The tests appear to have for-loops which BSC's evaluator unrolls, and if BSC is not careful in how it optimizes (and prunes branching paths), then the unrolling could expand uncontrollably. If these a testing BSC's ability to keep the unrolling out of control (by proper optimizations in ITransform), then maybe a direct way to measure that is by looking at a trace from the evaluator. I guess that heap size was deemed to be a good enough proxy for the size of the unrolling. And maybe that's OK, since the one "expected fail" uses significantly more heap. Although we might consider having GHC report the actual live bytes, as a better measure. (Interestingly, for Bug1490MyUnion.bsv that seemed to be only 10M out of 256M!?) We could keep this GitHub issue open for that, but I'm coming around to thinking that the heap is a good enough proxy and that it may not be worth our time to keep looking at it. So I'm going to close it, but feel free to comment.

quark17 commented 3 months ago

BSC is compiled with -with-rtsopts=-H256m -K10m -i1 which means that a minimum heap size of 256M is being set. It seems that GHC is ignoring the max heap -M value if the minimum is larger. So any -M value less than 256M has no effect, which is why BSC compiled with GHC 9.4 doesn't have a heap overflow with lower values. And the change in how the current heap size is calculated (starting in GHC 4.6) explains why the example started failing (because the old accounting was less than 256M while the new accounting is greater).

B-Lang-org / bsc

Improve the mechanism for testing Bluespec Inc bug 1490 #659