cburstedde / p4est

The "p4est" forest-of-octrees library
www.p4est.org/
GNU General Public License v2.0
254 stars 114 forks source link

Test failure in sc_test_sort #90

Closed carlodefalco closed 2 years ago

carlodefalco commented 3 years ago

Hi, running

make check

for the develop branch in macos 11.1 I get

PASS: test/sc_test_allgather
PASS: test/sc_test_arrays
PASS: test/sc_test_builtin
PASS: test/sc_test_darray_work
PASS: test/sc_test_dmatrix
PASS: test/sc_test_dmatrix_pool
PASS: test/sc_test_io_sink
PASS: test/sc_test_keyvalue
PASS: test/sc_test_node_comm
PASS: test/sc_test_notify
PASS: test/sc_test_reduce
PASS: test/sc_test_search
FAIL: test/sc_test_sort
PASS: test/sc_test_sortb
PASS: test/sc_test_version
PASS: test/sc_test_helpers
============================================================================
Testsuite summary for libsc 2.8.106-dc9e
============================================================================
# TOTAL: 16
# PASS:  15
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to info@p4est.org
============================================================================

I'd like to check what went wrong, but there is no "./test-suite.log" in the build directory

cburstedde commented 3 years ago

Could you run the test by hand and look into the terminal output? Try different -np settings.

carlodefalco commented 3 years ago

Hi,

first of all I found that, unlike what the output of

make check 

says, the output of the test is not in "./test-suite.log" but rather in "sc/test/sc_test_sort.log" the error seems to be a segfault :

[libsc] This is libsc 2.8.106-dc9e
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node guglielmo exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------
FAIL test/sc_test_sort (exit status: 139)

If I manually run the test I get the same segfault with 1 or 2 ranks I also wonder why running

make check

from the root directory only runs tests for sc and not for p4est ... I can force make to run p4est tests by typing them manually, e.g. :

$ make test/p8est_test_nodes.log
PASS: test/p8est_test_nodes

but

make check 

does not run them automatically

cburstedde commented 3 years ago

make check in the p4est build directory runs them for p4est and libsc. make check in the libsc build directory runs them for libsc only.

If you could compile --disable-shared and run the test through mpirun -np xx valgrind test/sc_test_sort... that would be helpful. Maybe you find an error or I'll look at it myself.

carlodefalco commented 3 years ago

make check in the p4est build directory runs them for p4est and libsc. make check in the libsc build directory runs them for libsc only.

that is the behaviour that I expected (and what I remember happened with previous versions) but it is not what I observe now.

If I type

make check

from the p4est build directory, only sc tests are executed. I can force p4est tests to run if I use a specific test as the make target, though.

E.g., typing

make test/p8est_test_nodes

in the p4est bulid directory will build the test/p8est_test_nodes executable and typing

make test/p8est_test_nodes.log

will run the test and produce the logfile. Maybe the tests are not run because of the failing test in sc? Here is the full output I get from "make check" in the p4est build dir :

$ make check
Making check in sc
/Applications/Xcode.app/Contents/Developer/usr/bin/make  test/sc_test_allgather test/sc_test_arrays test/sc_test_builtin test/sc_test_darray_work test/sc_test_dmatrix test/sc_test_dmatrix_pool test/sc_test_io_sink test/sc_test_keyvalue test/sc_test_node_comm test/sc_test_notify test/sc_test_reduce test/sc_test_search test/sc_test_sort test/sc_test_sortb test/sc_test_version test/sc_test_helpers
make[2]: `test/sc_test_allgather' is up to date.
make[2]: `test/sc_test_arrays' is up to date.
make[2]: `test/sc_test_builtin' is up to date.
make[2]: `test/sc_test_darray_work' is up to date.
make[2]: `test/sc_test_dmatrix' is up to date.
make[2]: `test/sc_test_dmatrix_pool' is up to date.
make[2]: `test/sc_test_io_sink' is up to date.
make[2]: `test/sc_test_keyvalue' is up to date.
make[2]: `test/sc_test_node_comm' is up to date.
make[2]: `test/sc_test_notify' is up to date.
make[2]: `test/sc_test_reduce' is up to date.
make[2]: `test/sc_test_search' is up to date.
make[2]: `test/sc_test_sort' is up to date.
make[2]: `test/sc_test_sortb' is up to date.
make[2]: `test/sc_test_version' is up to date.
make[2]: `test/sc_test_helpers' is up to date.
/Applications/Xcode.app/Contents/Developer/usr/bin/make  check-TESTS
PASS: test/sc_test_allgather
PASS: test/sc_test_arrays
PASS: test/sc_test_builtin
PASS: test/sc_test_darray_work
PASS: test/sc_test_dmatrix
PASS: test/sc_test_dmatrix_pool
PASS: test/sc_test_io_sink
PASS: test/sc_test_keyvalue
PASS: test/sc_test_node_comm
PASS: test/sc_test_notify
PASS: test/sc_test_reduce
PASS: test/sc_test_search
FAIL: test/sc_test_sort
PASS: test/sc_test_sortb
PASS: test/sc_test_version
PASS: test/sc_test_helpers
============================================================================
Testsuite summary for libsc 2.8.106-dc9e
============================================================================
# TOTAL: 16
# PASS:  15
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to info@p4est.org
============================================================================
make[3]: *** [test-suite.log] Error 1
make[2]: *** [check-TESTS] Error 2
make[1]: *** [check-am] Error 2
make: *** [check-recursive] Error 1

If you could compile --disable-shared and run the test through mpirun -np xx valgrind test/sc_test_sort... that would be helpful. Maybe you find an error or I'll look at it myself.

valgrind does not work well on macos AFAIK, but I'll try to see if I can find out soemthing interesting with lldb

thanks, Carlo

carlodefalco commented 3 years ago

Maybe the tests are not run because of the failing test in sc?

I confirm that this is actually what is happening : indeed, if any test fails in sc, make check exits with an error and does not run tests from p4est, if I disable the failing test in Makefile.am the tests run and I get 62 PASS and no FAIL.

I'll comment again if I can find put more about the failing test in SC

cburstedde commented 3 years ago

Thanks for investigating, it is not unreasonable to stop testing if a subdirectory fails.

Looking forward to the actual test output by this program.

carlodefalco commented 3 years ago

I compiled with --disable-shared --enable-static -O0 -g and tried the same test again in the lldb debugger. Here is the stacktrace returned by lldb

$ glibtool --mode=execute lldb ./sc_test_sort
(lldb) target create "./sc_test_sort"
Current executable set to '/Users/carlo/Documents/_Programmi/Altrui/p4est-github/BUILD/sc/test/sc_test_sort' (x86_64).
(lldb) run
Process 81351 launched: '/Users/carlo/Documents/_Programmi/Altrui/p4est-github/BUILD/sc/test/sc_test_sort' (x86_64)
2021-01-08 19:04:39.576583+0100 orted[81356:4257279] [CL_INVALID_OPERATION] : OpenCL Error : Failed to retrieve device information! Invalid enumerated value!

2021-01-08 19:04:39.576878+0100 orted[81356:4257279] [CL_INVALID_OPERATION] : OpenCL Error : Failed to retrieve device information! Invalid enumerated value!

2021-01-08 19:04:39.576890+0100 orted[81356:4257279] [CL_INVALID_OPERATION] : OpenCL Error : Failed to retrieve device information! Invalid enumerated value!

[libsc] This is libsc 2.8.106-dc9e
Process 81351 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7ffeefbff330)
    frame #0: 0x00007ffeefbff330
->  0x7ffeefbff330: xorb   %ah, %ch
    0x7ffeefbff332: adcb   %al, (%rcx)
    0x7ffeefbff334: addl   %eax, (%rax)
    0x7ffeefbff336: addb   %al, (%rax)
Target 0: (sc_test_sort) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7ffeefbff330)
  * frame #0: 0x00007ffeefbff330
    frame #1: 0x00007fff202b7b32 libsystem_c.dylib`_qsort + 686
    frame #2: 0x000000010000de91 sc_test_sort`sc_psort_bitonic(pst=0x00007ffeefbff330, lo=0, hi=11, dir=1) at sc_sort.c:423:7
    frame #3: 0x000000010000dd9f sc_test_sort`sc_psort(mpicomm=0x000000010110e530, base=0x000000010371c810, nmemb=0x000000010371c180, size=8, compar=(sc_test_sort`sc_double_compare at sc.c:750)) at sc_sort.c:483:3
    frame #4: 0x00000001000048dd sc_test_sort`main(argc=1, argv=0x00007ffeefbff8e8) at test_sort.c:107:3
    frame #5: 0x00007fff203a2621 libdyld.dylib`start + 1
    frame #6: 0x00007fff203a2621 libdyld.dylib`start + 1

It looks like the crash is occurring here I can inspect values of variables at the time of the crash if you tell what you think may be useful

Hope this helps, c.

P.S.: I have no idea what the OpenCL errors are about ...

cburstedde commented 3 years ago

It runs fine with me. Good catch though: Maybe your qsort_r is dead, or we're using it incorrectly. Can you comment the check for it and test with the regular qsort function?

carlodefalco commented 3 years ago

it may be that qsort_r has a different signature on the mac, indeed man qsort_r on my system says

QSORT(3)                 BSD Library Functions Manual                 QSORT(3)

NAME
     heapsort, heapsort_b, mergesort, mergesort_b, qsort, qsort_b, qsort_r -- sort functions

SYNOPSIS
     #include <stdlib.h>

     int
     heapsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));

     int
     heapsort_b(void *base, size_t nel, size_t width, int (^compar)(const void *, const void *));

     int
     mergesort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));

     int
     mergesort_b(void *base, size_t nel, size_t width, int (^compar)(const void *, const void *));

     void
     qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));

     void
     qsort_b(void *base, size_t nel, size_t width, int (^compar)(const void *, const void *));

     void
     qsort_r(void *base, size_t nel, size_t width, void *thunk, int (*compar)(void *, const void *, const void *));

the last argument seems to have a type that is different from that passed here

carlodefalco commented 3 years ago

indeed the patch below fixes the problem for me

diff --git a/src/sc_sort.c b/src/sc_sort.c
index ed2b527c..e9b7050c 100644
--- a/src/sc_sort.c
+++ b/src/sc_sort.c
@@ -64,7 +64,7 @@ sc_icompare (const void *v1, const void *v2)

 /** Pass qsort-style comparison function as third argument */
 static int
-sc_compare_r (const void *v1, const void *v2, void *arg)
+sc_compare_r (void *arg, const void *v1, const void *v2)
 {
   sc_psort_t         *pst = (sc_psort_t *) arg;
   return pst->compar (v1, v2);
@@ -72,7 +72,7 @@ sc_compare_r (const void *v1, const void *v2, void *arg)

 /** Pass qsort-style comparison function as third argument */
 static int
-sc_icompare_r (const void *v1, const void *v2, void *arg)
+sc_icompare_r (void *arg, const void *v1, const void *v2)
 {
   sc_psort_t         *pst = (sc_psort_t *) arg;
   return pst->compar (v2, v1);
@@ -421,7 +421,7 @@ sc_psort_bitonic (sc_psort_t * pst, size_t lo, size_t hi, int dir)
              n, pst->size, dir ? sc_compare : sc_icompare);
 #else
       qsort_r (pst->my_base + (lo - pst->my_lo) * pst->size,
-               n, pst->size, dir ? sc_compare_r : sc_icompare_r, pst);
+               n, pst->size, pst, dir ? sc_compare_r : sc_icompare_r);

 #endif
     }

of course it cannot be applied unconditionally though, one would need to change the configure test in order to detect whether GNU or BSD version of qsort_r is available ...

cburstedde commented 3 years ago

Indeed, this is unfortunate. We can (a) go back to standard qsort for macs or (b) you might like to write a configure test that distinguishes the two behaviours. I'd be happy to integrate if you go for (b).

carlodefalco commented 3 years ago

or maybe (c) use a portable version of qsort_r as provided in the GNU Portability Library ?

It seems it is provided there in order to fix exactly the same problem as occurred here

carlodefalco commented 3 years ago

this project has a configure test that does what is needed manually.

That is an old archived version of the project, the new version just relies on gnulib for portability instead

carlodefalco commented 3 years ago

The patch below works for me, could you please test it?

From 639ceb688cfbfbf00a46c49bd3504bd346fed048 Mon Sep 17 00:00:00 2001
From: Carlo de Falco <carlo.defalco@polimi.it>
Date: Sat, 9 Jan 2021 07:24:39 +0100
Subject: [PATCH] Check whether qsort_r is GNU or BSD style

---
 configure.ac  | 25 +++++++++++++++++++++++++
 src/sc_sort.c | 18 +++++++++++++-----
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/configure.ac b/configure.ac
index cd3bc34e..40ef3db1 100644
--- a/configure.ac
+++ b/configure.ac
@@ -85,6 +85,31 @@ AC_CHECK_FUNCS([strtol strtoll])
 AC_CHECK_FUNCS([fsync])
 AC_CHECK_FUNCS([qsort_r])

+# the order of qsort_r input parameters is reversed in BSD / macos
+if test x$ac_cv_func_qsort_r = xyes ; then
+  AC_MSG_CHECKING([whether qsort_r is BSD or GNU])
+  AC_RUN_IFELSE([AC_LANG_SOURCE([[
+  #define _GNU_SOURCE
+  #include <stdlib.h>
+  static int comparefun (const void *x, const void *x, void *z) {
+   return *(const double *)a - *(const double *)b;
+  }
+  int main (void)  {
+    double v[3] = { 1.0, 2.0, 0.0 };
+    qsort_r (v, 3, sizeof (double), comparefun, NULL);
+    return !(v[0] == 0.0 && v[1] == 1.0 && arr[2] == 2.0);
+  }]])],
+  [
+    AC_MSG_RESULT([GNU])
+  ],[
+    AC_DEFINE(HAVE_QSORT_R_BSD, 1, [Define to 1 if 'qsort_r' uses BSD style arguments])
+    AC_MSG_RESULT([BSD])
+  ],[
+    AC_MSG_RESULT([cannot test on a cross-compiler, assuming GNU])
+  ])
+fi
+
+
 echo "o---------------------------------------"
 echo "| Checking libraries"
 echo "o---------------------------------------"
diff --git a/src/sc_sort.c b/src/sc_sort.c
index ed2b527c..299b60c2 100644
--- a/src/sc_sort.c
+++ b/src/sc_sort.c
@@ -62,17 +62,25 @@ sc_icompare (const void *v1, const void *v2)

 #else

-/** Pass qsort-style comparison function as third argument */
+#ifdef SC_HAVE_QSORT_R_BSD
+#  define QSORT_R_PARAMS(P1, P2, P3, P4, P5)  (P1, P2, P3, P5, P4)
+#  define QSORT_R_CMP_PARAMS(P1, P2, P3)  (P3, P1, P2)
+#else
+#  define QSORT_R_PARAMS(P1, P2, P3, P4, P5)  (P1, P2, P3, P4, P5)
+#  define QSORT_R_CMP_PARAMS(P1, P2, P3)  (P1, P2, P3)
+#endif
+
+/** Pass qsort-style comparison function as additional argument */
 static int
-sc_compare_r (const void *v1, const void *v2, void *arg)
+sc_compare_r QSORT_R_CMP_PARAMS(const void *v1, const void *v2, void *arg)
 {
   sc_psort_t         *pst = (sc_psort_t *) arg;
   return pst->compar (v1, v2);
 }

-/** Pass qsort-style comparison function as third argument */
+/** Pass qsort-style comparison function as additional argument */
 static int
-sc_icompare_r (const void *v1, const void *v2, void *arg)
+sc_icompare_r  QSORT_R_CMP_PARAMS(const void *v1, const void *v2, void *arg)
 {
   sc_psort_t         *pst = (sc_psort_t *) arg;
   return pst->compar (v2, v1);
@@ -420,7 +428,7 @@ sc_psort_bitonic (sc_psort_t * pst, size_t lo, size_t hi, int dir)
       qsort (pst->my_base + (lo - pst->my_lo) * pst->size,
              n, pst->size, dir ? sc_compare : sc_icompare);
 #else
-      qsort_r (pst->my_base + (lo - pst->my_lo) * pst->size,
+      qsort_r  QSORT_R_PARAMS(pst->my_base + (lo - pst->my_lo) * pst->size,
                n, pst->size, dir ? sc_compare_r : sc_icompare_r, pst);

 #endif
-- 
2.29.2
cburstedde commented 3 years ago

Thanks! if you turn this into a pull request to the libsc repo, you'll be credited for contributing. I'd ask you to

cburstedde commented 3 years ago

It would also be nice to add a function sc_qsort_r_variant to sc_sort.h that checks the variant at run time without generating invalid accesses in either situation. This way we would be able to support cross-compile environments.

carlodefalco commented 3 years ago

It would also be nice to add a function sc_qsort_r_variant to sc_sort.h that checks the variant at run time without generating invalid accesses in either situation. This way we would be able to support cross-compile environments.

I am not aware of a way of checking the signature of a C function at runtime, how would you do this?

Alternatively one could include a portable implementation of qsort_r in the source code directly, then have an autoconf test to check whether qsort_r is available and works, otherwise build and used the portable version.

If cross-compiling, the portable qsort_r would be used.

That is what other projects do, relying on gnulib for the portable implementation.

If licensing issues prevent from using gnulib in sc/p4est because you need the code to be under BSD license, you could instead get an implementation of qsort_r from ... well, freeBSD

cburstedde commented 3 years ago

It would also be nice to add a function sc_qsort_r_variant to sc_sort.h that checks the variant at run time without generating invalid accesses in either situation. This way we would be able to support cross-compile environments.

I am not aware of a way of checking the signature of a C function at runtime, how would you do this?

I was just wondering if there is a form that compiles and runs for both the right and the wrong signature, such that the code can check the results.

Alternatively one could include a portable implementation of qsort_r in the source code directly, then have an autoconf test to check whether qsort_r is available and works, otherwise build and used the portable version.

If cross-compiling, the portable qsort_r would be used.

That is what other projects do, relying on gnulib for the portable implementation.

If licensing issues prevent from using gnulib in sc/p4est because you need the code to be under BSD license, you could instead get an implementation of qsort_r from ... well, freeBSD

I'd rather not go into the maintenance for it. Having a working test is fine.

cburstedde commented 3 years ago

If the qsort_r function may swap the callback and the data pointer, we can provide a data pointer that is such a callback. If the callback swaps the order of its arguments, we can use it to sort function pointers. This way we'd have a check that is well defined and compiles and runs for both implementations.

This is in addition to the configure check, which I still think should be first priority.

cburstedde commented 3 years ago

The issue is solved by a configure/cmake build check. Please check out the latest prev3-develop branches.

cburstedde commented 2 years ago

Does the test involving qsort_r now work ok? It should by what we have done. Please let us know.