Macaulay2 / mathicgb

Compute (signature) Groebner bases using the fast datastructures from mathic.
2 stars 4 forks source link

Unittest hangs on powerpc #3

Open d-torrance opened 8 years ago

d-torrance commented 8 years ago

This is a continuation of [1]. I attempted to build the latest version of the Macaulay2 fork of mathicgb on a powerpc machine. However, unittest hangs when it gets to GB.small.

I tried debugging it several times with gdb. I'd let it hang for a few seconds, hit Ctrl+C, and then ask for a backtrace. Each time it appeared to be somewhere else in the code. I'm guessing maybe it got stuck in an infinite loop somewhere? I've pasted one log file at [2].

Note that I tried compiling both with and without tbb, but there was no change either way.

Any ideas?

Thanks!

[1] https://github.com/Macaulay2/M2/issues/339 [2] http://pastebin.com/zByR6umD

d-torrance commented 8 years ago

I got a notice that a Debian auto-build of mathicgb on the s390x architecture failed [1]. It failed at the same point as the above, so this may be the same issue. This may also be related to [2].

[1] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=s390x&ver=1.0~git20150904-2&stamp=1458562447&file=log [2] https://github.com/Macaulay2/M2/issues/299

d-torrance commented 7 years ago

The latest version of mathicgb still fails to build for the big-endian architectures in Debian (mips [1], s390x [2], powerpc [3], ppc64 [4], sparc64 [5], and hppa (still building, but I expect it will fail).)

Fortunately, the Debian build logs are much more informative than they were when I first reported this. For example, from [1], we have (only pasting the failing tests):

[----------] 11 tests from GB
[ RUN      ] GB.small
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.small (124093 ms)
[ RUN      ] GB.liu_0_1
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.liu_0_1 (61207 ms)
[ RUN      ] GB.weispfennig97_0_4
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.weispfennig97_0_4 (103391 ms)
[ RUN      ] GB.weispfennig97_0_5
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.weispfennig97_0_5 (103105 ms)
[ RUN      ] GB.gerdt93_0_1
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.gerdt93_0_1 (126415 ms)
[ RUN      ] GB.gerdt93_0_2
[       OK ] GB.gerdt93_0_2 (152 ms)
[ RUN      ] GB.gerdt93_0_3
[       OK ] GB.gerdt93_0_3 (158 ms)
[ RUN      ] GB.gerdt93_0_4
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.gerdt93_0_4 (126387 ms)
[ RUN      ] GB.gerdt93_0_5
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.gerdt93_0_5 (126289 ms)
[ RUN      ] GB.gerdt93_0_6
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.gerdt93_0_6 (126732 ms)
[ RUN      ] GB.gerdt93_0_7
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] GB.gerdt93_0_7 (126670 ms)
[----------] 11 tests from GB (1024599 ms total)

[----------] 5 tests from F4MatrixBuilder
[ RUN      ] F4MatrixBuilder.Empty
[       OK ] F4MatrixBuilder.Empty (0 ms)
[ RUN      ] F4MatrixBuilder.SPair
unknown file: Failure
C++ exception with description "Too many columns in QuadMatrix" thrown in the test body.
[  FAILED  ] F4MatrixBuilder.SPair (3744591 ms)
[ RUN      ] F4MatrixBuilder.OneByOne
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] F4MatrixBuilder.OneByOne (33647 ms)
[ RUN      ] F4MatrixBuilder.DirectReducers
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] F4MatrixBuilder.DirectReducers (34090 ms)
[ RUN      ] F4MatrixBuilder.IteratedReducer
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[  FAILED  ] F4MatrixBuilder.IteratedReducer (36443 ms)
[----------] 5 tests from F4MatrixBuilder (3848771 ms total)

[----------] 13 tests from Monoids/1, where TypeParam = mgb::MonoMonoid<int, false, true, true>
[ RUN      ] Monoids/1.VarCount
[       OK ] Monoids/1.VarCount (335 ms)
[ RUN      ] Monoids/1.MonoVector
[       OK ] Monoids/1.MonoVector (40 ms)
[ RUN      ] Monoids/1.MonoArena
[       OK ] Monoids/1.MonoArena (0 ms)
[ RUN      ] Monoids/1.ReadWriteMonoid
[       OK ] Monoids/1.ReadWriteMonoid (1 ms)
[ RUN      ] Monoids/1.MonoPool
[       OK ] Monoids/1.MonoPool (132 ms)
[ RUN      ] Monoids/1.setExponentAndComponent
[       OK ] Monoids/1.setExponentAndComponent (0 ms)
[ RUN      ] Monoids/1.MultiplyDivide
src/test/MonoMonoid.cpp:440: Failure
Value of: m.isProductOfHintTrue(a, b, c)
  Actual: false
Expected: true
src/test/MonoMonoid.cpp:440: Failure
Value of: m.isProductOfHintTrue(a, b, c)
  Actual: false
Expected: true
src/test/MonoMonoid.cpp:440: Failure
Value of: m.isProductOfHintTrue(a, b, c)
  Actual: false
Expected: true
[  FAILED  ] Monoids/1.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, true, true> (2 ms)
[ RUN      ] Monoids/1.LcmColon
[       OK ] Monoids/1.LcmColon (4 ms)
[ RUN      ] Monoids/1.Order
[       OK ] Monoids/1.Order (20 ms)
[ RUN      ] Monoids/1.RelativelyPrime
[       OK ] Monoids/1.RelativelyPrime (0 ms)
[ RUN      ] Monoids/1.SetExponents
[       OK ] Monoids/1.SetExponents (0 ms)
[ RUN      ] Monoids/1.HasAmpleCapacityTotalDegree
[       OK ] Monoids/1.HasAmpleCapacityTotalDegree (5 ms)
[ RUN      ] Monoids/1.CopyEqualConversion
[       OK ] Monoids/1.CopyEqualConversion (9 ms)
[----------] 13 tests from Monoids/1 (549 ms total)

[----------] 13 tests from Monoids/2, where TypeParam = mgb::MonoMonoid<int, false, false, true>
[ RUN      ] Monoids/2.VarCount
[       OK ] Monoids/2.VarCount (344 ms)
[ RUN      ] Monoids/2.MonoVector
[       OK ] Monoids/2.MonoVector (39 ms)
[ RUN      ] Monoids/2.MonoArena
[       OK ] Monoids/2.MonoArena (0 ms)
[ RUN      ] Monoids/2.ReadWriteMonoid
[       OK ] Monoids/2.ReadWriteMonoid (1 ms)
[ RUN      ] Monoids/2.MonoPool
[       OK ] Monoids/2.MonoPool (121 ms)
[ RUN      ] Monoids/2.setExponentAndComponent
[       OK ] Monoids/2.setExponentAndComponent (1 ms)
[ RUN      ] Monoids/2.MultiplyDivide
src/test/MonoMonoid.cpp:440: Failure
Value of: m.isProductOfHintTrue(a, b, c)
  Actual: false
Expected: true
src/test/MonoMonoid.cpp:440: Failure
Value of: m.isProductOfHintTrue(a, b, c)
  Actual: false
Expected: true
src/test/MonoMonoid.cpp:440: Failure
Value of: m.isProductOfHintTrue(a, b, c)
  Actual: false
Expected: true
[  FAILED  ] Monoids/2.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, false, true> (2 ms)
[ RUN      ] Monoids/2.LcmColon
[       OK ] Monoids/2.LcmColon (3 ms)
[ RUN      ] Monoids/2.Order
[       OK ] Monoids/2.Order (21 ms)
[ RUN      ] Monoids/2.RelativelyPrime
[       OK ] Monoids/2.RelativelyPrime (0 ms)
[ RUN      ] Monoids/2.SetExponents
[       OK ] Monoids/2.SetExponents (0 ms)
[ RUN      ] Monoids/2.HasAmpleCapacityTotalDegree
[       OK ] Monoids/2.HasAmpleCapacityTotalDegree (5 ms)
[ RUN      ] Monoids/2.CopyEqualConversion
[       OK ] Monoids/2.CopyEqualConversion (11 ms)
[----------] 13 tests from Monoids/2 (549 ms total)

[----------] Global test environment tear-down
[==========] 237 tests from 27 test cases ran. (4881705 ms total)
[  PASSED  ] 222 tests.
[  FAILED  ] 15 tests, listed below:
[  FAILED  ] GB.small
[  FAILED  ] GB.liu_0_1
[  FAILED  ] GB.weispfennig97_0_4
[  FAILED  ] GB.weispfennig97_0_5
[  FAILED  ] GB.gerdt93_0_1
[  FAILED  ] GB.gerdt93_0_4
[  FAILED  ] GB.gerdt93_0_5
[  FAILED  ] GB.gerdt93_0_6
[  FAILED  ] GB.gerdt93_0_7
[  FAILED  ] F4MatrixBuilder.SPair
[  FAILED  ] F4MatrixBuilder.OneByOne
[  FAILED  ] F4MatrixBuilder.DirectReducers
[  FAILED  ] F4MatrixBuilder.IteratedReducer
[  FAILED  ] Monoids/1.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, true, true>
[  FAILED  ] Monoids/2.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, false, true>

We had some similar issues in fflas-ffpack (linbox-team/fflas-ffpack#45). The problem as I understand it was that some integers were copied piece by piece, assuming little endian ordering. Perhaps something similar is going on here?

[1] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=mips&ver=1.0%7Egit20170104-1&stamp=1484219557 [2] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=s390x&ver=1.0%7Egit20170104-1&stamp=1484222030 [3] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=powerpc&ver=1.0%7Egit20170104-1&stamp=1484222141 [4] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=ppc64&ver=1.0%7Egit20170104-1&stamp=1484216651 [5] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=sparc64&ver=1.0%7Egit20170104-1&stamp=1484214249

mikestillman commented 7 years ago

Thanks, this is helpful. There is no large integer code here, I don’t think, but I’ll review the code.

On Jan 12, 2017, at 9:25 AM, Doug Torrance notifications@github.com wrote:

The latest version of mathicgb still fails to build for the big-endian architectures in Debian (mips [1], s390x [2], powerpc [3], ppc64 [4], sparc64 [5], and hppa (still building, but I expect it will fail).)

Fortunately, the Debian build logs are much more informative than they were when I first reported this. For example, from [1], we have (only pasting the failing tests):

[----------] 11 tests from GB [ RUN ] GB.small unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.small (124093 ms) [ RUN ] GB.liu_0_1 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.liu_0_1 (61207 ms) [ RUN ] GB.weispfennig97_0_4 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.weispfennig97_0_4 (103391 ms) [ RUN ] GB.weispfennig97_0_5 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.weispfennig97_0_5 (103105 ms) [ RUN ] GB.gerdt93_0_1 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.gerdt93_0_1 (126415 ms) [ RUN ] GB.gerdt93_0_2 [ OK ] GB.gerdt93_0_2 (152 ms) [ RUN ] GB.gerdt93_0_3 [ OK ] GB.gerdt93_0_3 (158 ms) [ RUN ] GB.gerdt93_0_4 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.gerdt93_0_4 (126387 ms) [ RUN ] GB.gerdt93_0_5 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.gerdt93_0_5 (126289 ms) [ RUN ] GB.gerdt93_0_6 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.gerdt93_0_6 (126732 ms) [ RUN ] GB.gerdt93_0_7 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] GB.gerdt93_0_7 (126670 ms) [----------] 11 tests from GB (1024599 ms total)

[----------] 5 tests from F4MatrixBuilder [ RUN ] F4MatrixBuilder.Empty [ OK ] F4MatrixBuilder.Empty (0 ms) [ RUN ] F4MatrixBuilder.SPair unknown file: Failure C++ exception with description "Too many columns in QuadMatrix" thrown in the test body. [ FAILED ] F4MatrixBuilder.SPair (3744591 ms) [ RUN ] F4MatrixBuilder.OneByOne unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] F4MatrixBuilder.OneByOne (33647 ms) [ RUN ] F4MatrixBuilder.DirectReducers unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] F4MatrixBuilder.DirectReducers (34090 ms) [ RUN ] F4MatrixBuilder.IteratedReducer unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] F4MatrixBuilder.IteratedReducer (36443 ms) [----------] 5 tests from F4MatrixBuilder (3848771 ms total)

[----------] 13 tests from Monoids/1, where TypeParam = mgb::MonoMonoid<int, false, true, true> [ RUN ] Monoids/1.VarCount [ OK ] Monoids/1.VarCount (335 ms) [ RUN ] Monoids/1.MonoVector [ OK ] Monoids/1.MonoVector (40 ms) [ RUN ] Monoids/1.MonoArena [ OK ] Monoids/1.MonoArena (0 ms) [ RUN ] Monoids/1.ReadWriteMonoid [ OK ] Monoids/1.ReadWriteMonoid (1 ms) [ RUN ] Monoids/1.MonoPool [ OK ] Monoids/1.MonoPool (132 ms) [ RUN ] Monoids/1.setExponentAndComponent [ OK ] Monoids/1.setExponentAndComponent (0 ms) [ RUN ] Monoids/1.MultiplyDivide src/test/MonoMonoid.cpp:440: Failure Value of: m.isProductOfHintTrue(a, b, c) Actual: false Expected: true src/test/MonoMonoid.cpp:440: Failure Value of: m.isProductOfHintTrue(a, b, c) Actual: false Expected: true src/test/MonoMonoid.cpp:440: Failure Value of: m.isProductOfHintTrue(a, b, c) Actual: false Expected: true [ FAILED ] Monoids/1.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, true, true> (2 ms) [ RUN ] Monoids/1.LcmColon [ OK ] Monoids/1.LcmColon (4 ms) [ RUN ] Monoids/1.Order [ OK ] Monoids/1.Order (20 ms) [ RUN ] Monoids/1.RelativelyPrime [ OK ] Monoids/1.RelativelyPrime (0 ms) [ RUN ] Monoids/1.SetExponents [ OK ] Monoids/1.SetExponents (0 ms) [ RUN ] Monoids/1.HasAmpleCapacityTotalDegree [ OK ] Monoids/1.HasAmpleCapacityTotalDegree (5 ms) [ RUN ] Monoids/1.CopyEqualConversion [ OK ] Monoids/1.CopyEqualConversion (9 ms) [----------] 13 tests from Monoids/1 (549 ms total)

[----------] 13 tests from Monoids/2, where TypeParam = mgb::MonoMonoid<int, false, false, true> [ RUN ] Monoids/2.VarCount [ OK ] Monoids/2.VarCount (344 ms) [ RUN ] Monoids/2.MonoVector [ OK ] Monoids/2.MonoVector (39 ms) [ RUN ] Monoids/2.MonoArena [ OK ] Monoids/2.MonoArena (0 ms) [ RUN ] Monoids/2.ReadWriteMonoid [ OK ] Monoids/2.ReadWriteMonoid (1 ms) [ RUN ] Monoids/2.MonoPool [ OK ] Monoids/2.MonoPool (121 ms) [ RUN ] Monoids/2.setExponentAndComponent [ OK ] Monoids/2.setExponentAndComponent (1 ms) [ RUN ] Monoids/2.MultiplyDivide src/test/MonoMonoid.cpp:440: Failure Value of: m.isProductOfHintTrue(a, b, c) Actual: false Expected: true src/test/MonoMonoid.cpp:440: Failure Value of: m.isProductOfHintTrue(a, b, c) Actual: false Expected: true src/test/MonoMonoid.cpp:440: Failure Value of: m.isProductOfHintTrue(a, b, c) Actual: false Expected: true [ FAILED ] Monoids/2.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, false, true> (2 ms) [ RUN ] Monoids/2.LcmColon [ OK ] Monoids/2.LcmColon (3 ms) [ RUN ] Monoids/2.Order [ OK ] Monoids/2.Order (21 ms) [ RUN ] Monoids/2.RelativelyPrime [ OK ] Monoids/2.RelativelyPrime (0 ms) [ RUN ] Monoids/2.SetExponents [ OK ] Monoids/2.SetExponents (0 ms) [ RUN ] Monoids/2.HasAmpleCapacityTotalDegree [ OK ] Monoids/2.HasAmpleCapacityTotalDegree (5 ms) [ RUN ] Monoids/2.CopyEqualConversion [ OK ] Monoids/2.CopyEqualConversion (11 ms) [----------] 13 tests from Monoids/2 (549 ms total)

[----------] Global test environment tear-down [==========] 237 tests from 27 test cases ran. (4881705 ms total) [ PASSED ] 222 tests. [ FAILED ] 15 tests, listed below: [ FAILED ] GB.small [ FAILED ] GB.liu_0_1 [ FAILED ] GB.weispfennig97_0_4 [ FAILED ] GB.weispfennig97_0_5 [ FAILED ] GB.gerdt93_0_1 [ FAILED ] GB.gerdt93_0_4 [ FAILED ] GB.gerdt93_0_5 [ FAILED ] GB.gerdt93_0_6 [ FAILED ] GB.gerdt93_0_7 [ FAILED ] F4MatrixBuilder.SPair [ FAILED ] F4MatrixBuilder.OneByOne [ FAILED ] F4MatrixBuilder.DirectReducers [ FAILED ] F4MatrixBuilder.IteratedReducer [ FAILED ] Monoids/1.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, true, true> [ FAILED ] Monoids/2.MultiplyDivide, where TypeParam = mgb::MonoMonoid<int, false, false, true> We had some similar issues in fflas-ffpack (linbox-team/fflas-ffpack#45 https://github.com/linbox-team/fflas-ffpack/issues/45). The problem as I understand it was that some integers were copied piece by piece, assuming long endian ordering. Perhaps something similar is going on here?

[1] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=mips&ver=1.0%7Egit20170104-1&stamp=1484219557 https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=mips&ver=1.0%7Egit20170104-1&stamp=1484219557 [2] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=s390x&ver=1.0%7Egit20170104-1&stamp=1484222030 https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=s390x&ver=1.0%7Egit20170104-1&stamp=1484222030 [3] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=powerpc&ver=1.0%7Egit20170104-1&stamp=1484222141 https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=powerpc&ver=1.0%7Egit20170104-1&stamp=1484222141 [4] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=ppc64&ver=1.0%7Egit20170104-1&stamp=1484216651 https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=ppc64&ver=1.0%7Egit20170104-1&stamp=1484216651 [5] https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=sparc64&ver=1.0%7Egit20170104-1&stamp=1484214249 https://buildd.debian.org/status/fetch.php?pkg=mathicgb&arch=sparc64&ver=1.0%7Egit20170104-1&stamp=1484214249 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Macaulay2/mathicgb/issues/3#issuecomment-272175762, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGPR_tRI8Vkkvdb4JBsBrSzhljyMqgfks5rRje9gaJpZM4F4g3o.

jamesjer commented 7 years ago

We're seeing something similar in Fedora. On big endian architectures, the GB.small test allocates more and more and more memory until the OOM killer kills it. Is there any chance that the Debian test, which seems more well-behaved, could be modified to catch the bad_alloc exception so we can see which allocation attempt failed? That might give a clue as to what is going wrong.

jamesjer commented 7 years ago

By inserting some print statements into the code, I've narrowed the problem down to src/mathicgb/F4MatrixBuilder2.cpp, in appendRow(), line 346. On big endian architectures, this test at line 385:

if (colPair.first == 0 || colPair.second == 0)

is always true, so control is always transferred back to the updateReader label. My guess, and it is a guess, is that there are two bugs here: something is wrong with the computation of colPair.first and colPair.second on big endian architectures, and the out of control memory allocation is because every jump to updateReader executes this line again:

ColReader colMap(mMap);

Either colMap doesn't need to be reconstructed every time, in which case the label should be moved down one line, or the value in colMap needs to be deconstructed prior to the goto.

jamesjer commented 7 years ago

See https://github.com/Macaulay2/mathicgb/pull/9 for a fix for the endianness issue. I don't know how the memory leak in appendRow() should be fixed, so I'm leaving that one alone.