NLAFET / StarNEig

A Task-based Library for Solving Dense Nonsymmetric Eigenvalue Problems
https://nlafet.github.io/StarNEig/
Other
21 stars 2 forks source link

Build failure when _aligned_alloc is not present: Undefined symbols: "_aligned_alloc", referenced from: _alloc_matrix in common.c.o #1

Closed barracuda156 closed 1 year ago

barracuda156 commented 1 year ago

CMakeLists check for presence of aligned_alloc, but then nothing is done if it is not present. https://github.com/NLAFET/StarNEig/blob/d47ed4dfbcdaec52e44f0b02d14a6e0cde64d286/src/CMakeLists.txt#L544

Expectedly, the build fails then with:

Undefined symbols:
  "_aligned_alloc", referenced from:
      _alloc_matrix in common.c.o
ld: symbol(s) not found
collect2: error: ld returned 1 exit status
make[2]: *** [starneig-test] Error 1

Compiler recommends including stdlib.h, but it does not help:

/opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/StarNEig-0.1.8/test/common/common.c: In function 'alloc_matrix':
/opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/StarNEig-0.1.8/test/common/common.c:108:11: warning: implicit declaration of function 'aligned_alloc' [-Wimplicit-function-declaration]
  108 |     ptr = aligned_alloc(64, n*(*ld)*elemsize);
      |           ^~~~~~~~~~~~~
/opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/StarNEig-0.1.8/test/common/common.c:43:1: note: include '<stdlib.h>' or provide a declaration of 'aligned_alloc'
   42 | #include <stdio.h>
  +++ |+#include <stdlib.h>
   43 | #include <stdlib.h>
/opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/StarNEig-0.1.8/test/common/common.c:108:11: warning: incompatible implicit declaration of built-in function 'aligned_alloc' [-Wbuiltin-declaration-mismatch]
  108 |     ptr = aligned_alloc(64, n*(*ld)*elemsize);
      |           ^~~~~~~~~~~~~
/opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/StarNEig-0.1.8/test/common/common.c:108:11: note: include '<stdlib.h>' or provide a declaration of 'aligned_alloc'
mirkomyl commented 1 year ago

The very next line uses the configure_file command to write the configuration to ${CMAKE_CURRENT_BINARY_DIR}/starneig_config.h. That is, your_build_dir/src/starneig_config.h should contain either

#define ALIGNED_ALLOC_FOUND

or

/* #define ALIGNED_ALLOC_FOUND */

Since linker is complaining about it, it must be the former. Could you confirm this?

It looks like you are using MacOS. Is this correct?

Assuming CMake has indeed detected aligned_alloc, something must have gone wrong during the linking phase. Perhaps MacOS and/or CMake handles linking somehow differently when compared to Linux. Unfortunately, I do not have an access to a MacOS machine so I cannot test this myself.

barracuda156 commented 1 year ago

@mirkomyl Thank you for responding!

I will check logs soon, but I can say that:

  1. Configure does not detect aligned_alloc (which is correct).
  2. MacOS does not have it until Darwin 16, I think. So yeah, it is not supported on the OS level.
  3. I did not see a fallback option in the source code (therefore this ticket).
  4. C11 is supposed to have it (judging from documentation), but it does not work on MacOS, regardless of -std= flags passed.
  5. posix_memalign may be used instead, I have made a patch, compilation succeeds, but tests fail with Bus error. So likely alignment is set wrong. I will look into that.
mirkomyl commented 1 year ago

Could you check PR #2?

barracuda156 commented 1 year ago

I will try building from the master with that patch added in an hour or so, and update you.

barracuda156 commented 1 year ago

@mirkomyl On a side-note, on PowerPC (regardless of OS) -mtune=native should be used for optimizations, not -march=native or -mtune=generic. https://www.rowleydownload.co.uk/arm/documentation/gnu/gcc/RS_002f6000-and-PowerPC-Options.html https://github.com/mfem/mfem/issues/216

mirkomyl commented 1 year ago

@mirkomyl On a side-note, on PowerPC (regardless of OS) -mtune=native should be used for optimizations, not -march=native or -mtune=generic. https://www.rowleydownload.co.uk/arm/documentation/gnu/gcc/RS_002f6000-and-PowerPC-Options.html mfem/mfem#216

In that case a user may disable the STARNEIG_ENABLE_OPTIMIZATION option [1] as done in the binary packages [2].

barracuda156 commented 1 year ago

In that case a user may disable the STARNEIG_ENABLE_OPTIMIZATION option [1] as done in the binary packages [2].

Yes, I know, in fact those flags are not enforced anyway. What I meant is rather add a valid optimization option for PPC. I mean, I can make a PR for that, if this is more convenient, it is an easy fix.

barracuda156 commented 1 year ago

@mirkomyl Update on tests. I have rebuilt v. 0.1.8 with your patch from PR referred. The build succeeds. Tests still fail, but a bit differently:

Start testing: Jan 15 21:02 MYT
----------------------------------------------------------
1/10 Testing: simple-hessenberg
1/10 Test: simple-hessenberg
Command: "/opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/build/starneig-test" "--experiment" "hessenberg" "--n" "5000" "--solver" "starneig-simple" "--keep-going"
Directory: /opt/local/var/macports/build/_opt_PPCRosettaPorts_math_StarNEig/StarNEig/work/build/test
"simple-hessenberg" start time: Jan 15 21:02 MYT
Output:
----------------------------------------------------------
TEST: --seed 1673787767 --experiment hessenberg --test-workers default --blas-threads default --lapack-threads default --scalapack-threads default --data-format pencil-local --init default --n 5000 --solver starneig-simple --cores default --gpus default --hooks hessenberg:normal residual:normal --residual-fail-threshold 10000 --residual-warn-threshold 500 --repeat 1 --warmup 0 --keep-going
THREADS: Using 0 StarPU worker threads during initialization and validation.
THREADS: Using 0 BLAS threads during initialization and validation.
THREADS: Using 0 BLAS threads in LAPACK solvers.
THREADS: Using 1 BLAS threads in ScaLAPACK solvers.
INIT...
PREPARE...
[starneig][fatal error] Something unexpected happened.
<end of output>
Test time =   1.36 sec
----------------------------------------------------------
Test Failed.

Should I try running via GDB?

P. S. Let me try to rebuild starpu also, I built it with gcc-4.2 initially, may not be the optimal choice.

barracuda156 commented 1 year ago

Just for the record: build_test_log.txt

mirkomyl commented 1 year ago

It appears that the test fails before the actual solver routine is called (PREPARE...) so something goes wrong during initialization. Very difficult to say what is happening without seeing a backtrace. What worries me is the the fact that the test program reports it is using zero workers etc, so perhaps this is a hwloc issue. You can see what the output should like look like from the manual [1].

Regarding PPC and MacOS support in general, StarNEig is meant to be used in a Linux environment. I know it does work in Windows (WSL) without CUDA-support but PPC Macs are not within the target group. It would be nice if StarNEig worked in such an environment but I do not consider it a priority.

ADD: You may want to compile StarNEig with the STARNEIG_ENABLE_VERBOSE option enabled.

barracuda156 commented 1 year ago

@mirkomyl Thank you, I will try enabling verbose.

For PPC, I generally don’t expect anyone to make dedicated fixes, of course, since regardless of interest in that, the hardware is understandably scarce. However it is perhaps a rare instance when a fix needed is genuinely macOS PPC-specific (ABI differs from ELF, but it is usually relevant for assembler or otherwise alignments). As long as we consider C, C++ and Fortran, whatever works for Linux and BSD usually can work for macOS, including PPC versions. Exceptions are graphics- and web-related, when needed features are missing from the SDK (this won’t be arch-specific but rather macOS version-specific).

barracuda156 commented 1 year ago

What worries me is the the fact that the test program reports it is using zero workers etc, so perhaps this is a hwloc issue.

Looking at the code here https://github.com/open-mpi/hwloc/blob/master/hwloc/topology-darwin.c I will not be surprised if it is broken. Cache line sizes look wrong for PPC case etc.

barracuda156 commented 1 year ago

@mirkomyl I tried building against vecLibFort (interface of Accelerate) instead of OpenBLAS, but got linking error:

Undefined symbols:
  "_dgghd3_", referenced from:
      _starneig_GEP_SM_HessenbergTriangular in lapack.c.o
      _starneig_GEP_SM_HessenbergTriangular in lapack.c.o
ld: symbol(s) not found
collect2: error: ld returned 1 exit status
mirkomyl commented 1 year ago

It looks like vecLibFort is a some type of lightweight wrapper for BLAS and LAPACK libraries. It is thus somewhat unclear which BLAS and LAPACK libraries you are using (note that OpenBLAS includes both BLAS and LAPACK). The DGGHD3 routine is relatively new addition to LAPACK so perhaps some older LAPACK version do not have it.

barracuda156 commented 1 year ago

@mirkomyl vecLibFort is an interface to Apple (native) Accelerate. It just enables to use it with Fortran.

mirkomyl commented 1 year ago

What worries me is the the fact that the test program reports it is using zero workers etc, so perhaps this is a hwloc issue.

Looking at the code here https://github.com/open-mpi/hwloc/blob/master/hwloc/topology-darwin.c I will not be surprised if it is broken. Cache line sizes look wrong for PPC case etc.

StarNEig uses hwloc as a ground truth when deciding how many CPU cores to use. Also, some tasks use it for memory allocations. Fully functional hwloc is thus a mandatory requirement.

barracuda156 commented 1 year ago

StarNEig uses hwloc as a ground truth when deciding how many CPU cores to use. Also, some tasks use it for memory allocations. Fully functional hwloc is thus a mandatory requirement.

Thank you, I will take a closer look at it.

For DGGHD3, is it possible to provide an internal fallback? Generally speaking, Apple own BLAS/LAPACK is more reliable, at least on older macOS.

mirkomyl commented 1 year ago

For DGGHD3, is it possible to provide an internal fallback? Generally speaking, Apple own BLAS/LAPACK is more reliable, at least on older macOS.

If this was a more common issues, then perhaps in could be included with the library in the same way the pdgghrd routine is included. However, StarNEig was developed as a part of a research project that promised to develop state-of-the-art numerical software for modern multi-node multi-core multi-GPU systems. It was thus build using the latest tools and supporting older hardware and software was never a priority. If Apple's LAPACK library is really missing the DGGHD3 routine, then I would simply conclude it is too old.

barracuda156 commented 1 year ago

Well, I guess we could live with OpenBLAS then. Allow me some time to dig into hwloc thing, I will update in a while.

barracuda156 commented 1 year ago

@mirkomyl I had no chance to dig into hwloc code yet, but it actually passes all tests (10.6.8 Rosetta):

--->  Testing hwloc
Executing:  cd "/opt/local/var/macports/build/_opt_PPCRosettaPorts_devel_hwloc/hwloc/work/hwloc-2.8.0" && /usr/bin/make check 
Making check in include
make[1]: Nothing to be done for `check'.
Making check in hwloc
/usr/bin/make  
make[2]: Nothing to be done for `all'.
Making check in utils
Making check in hwloc
Making check in .
/usr/bin/make  check-TESTS
PASS: test-hwloc-annotate.sh
PASS: test-hwloc-calc.sh
PASS: test-hwloc-compress-dir.sh
PASS: test-hwloc-diffpatch.sh
PASS: test-hwloc-distrib.sh
PASS: test-hwloc-info.sh
PASS: test-build-custom-topology.sh
PASS: test-parsing-flags.sh
============================================================================
Testsuite summary for hwloc 2.8.0
============================================================================
# TOTAL: 8
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
Making check in lstopo
/usr/bin/make  check-TESTS
PASS: test-lstopo.sh
============================================================================
Testsuite summary for hwloc 2.8.0
============================================================================
# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[2]: Nothing to be done for `check-am'.
Making check in tests
Making check in hwloc
Making check in .
/usr/bin/make  hwloc_api_version hwloc_list_components hwloc_bitmap hwloc_bitmap_string hwloc_bitmap_compare_inclusion hwloc_get_closest_objs hwloc_get_obj_covering_cpuset hwloc_get_cache_covering_cpuset hwloc_get_largest_objs_inside_cpuset hwloc_get_next_obj_covering_cpuset hwloc_get_obj_inside_cpuset hwloc_get_shared_cache_covering_obj hwloc_get_obj_below_array_by_type hwloc_get_obj_with_same_locality hwloc_bitmap_first_last_weight hwloc_bitmap_singlify hwloc_type_depth hwloc_type_sscanf hwloc_bind hwloc_get_last_cpu_location hwloc_get_area_memlocation hwloc_object_userdata hwloc_synthetic hwloc_backends hwloc_pci_backend hwloc_is_thissystem hwloc_distances hwloc_groups hwloc_insert_misc hwloc_topology_allow hwloc_topology_restrict hwloc_topology_dup hwloc_topology_diff hwloc_topology_abi hwloc_obj_infos hwloc_iodevs cpuset_nodeset memattrs memtiers cpukinds xmlbuffer gl           
  CC       hwloc_api_version.o
  CCLD     hwloc_api_version
  CC       hwloc_list_components.o
  CCLD     hwloc_list_components
  CC       hwloc_bitmap.o
  CCLD     hwloc_bitmap
  CC       hwloc_bitmap_string.o
  CCLD     hwloc_bitmap_string
  CC       hwloc_bitmap_compare_inclusion.o
  CCLD     hwloc_bitmap_compare_inclusion
  CC       hwloc_get_closest_objs.o
  CCLD     hwloc_get_closest_objs
  CC       hwloc_get_obj_covering_cpuset.o
  CCLD     hwloc_get_obj_covering_cpuset
  CC       hwloc_get_cache_covering_cpuset.o
  CCLD     hwloc_get_cache_covering_cpuset
  CC       hwloc_get_largest_objs_inside_cpuset.o
  CCLD     hwloc_get_largest_objs_inside_cpuset
  CC       hwloc_get_next_obj_covering_cpuset.o
  CCLD     hwloc_get_next_obj_covering_cpuset
  CC       hwloc_get_obj_inside_cpuset.o
  CCLD     hwloc_get_obj_inside_cpuset
  CC       hwloc_get_shared_cache_covering_obj.o
  CCLD     hwloc_get_shared_cache_covering_obj
  CC       hwloc_get_obj_below_array_by_type.o
  CCLD     hwloc_get_obj_below_array_by_type
  CC       hwloc_get_obj_with_same_locality.o
  CCLD     hwloc_get_obj_with_same_locality
  CC       hwloc_bitmap_first_last_weight.o
  CCLD     hwloc_bitmap_first_last_weight
  CC       hwloc_bitmap_singlify.o
  CCLD     hwloc_bitmap_singlify
  CC       hwloc_type_depth.o
  CCLD     hwloc_type_depth
  CC       hwloc_type_sscanf.o
  CCLD     hwloc_type_sscanf
  CC       hwloc_bind.o
  CCLD     hwloc_bind
  CC       hwloc_get_last_cpu_location.o
  CCLD     hwloc_get_last_cpu_location
  CC       hwloc_get_area_memlocation.o
  CCLD     hwloc_get_area_memlocation
  CC       hwloc_object_userdata.o
  CCLD     hwloc_object_userdata
  CC       hwloc_synthetic.o
  CCLD     hwloc_synthetic
  CC       hwloc_backends.o
  CCLD     hwloc_backends
  CC       hwloc_pci_backend.o
  CCLD     hwloc_pci_backend
  CC       hwloc_is_thissystem.o
  CCLD     hwloc_is_thissystem
  CC       hwloc_distances.o
  CCLD     hwloc_distances
  CC       hwloc_groups.o
  CCLD     hwloc_groups
  CC       hwloc_insert_misc.o
  CCLD     hwloc_insert_misc
  CC       hwloc_topology_allow.o
  CCLD     hwloc_topology_allow
  CC       hwloc_topology_restrict.o
  CCLD     hwloc_topology_restrict
  CC       hwloc_topology_dup.o
  CCLD     hwloc_topology_dup
  CC       hwloc_topology_diff.o
  CCLD     hwloc_topology_diff
  CC       hwloc_topology_abi.o
  CCLD     hwloc_topology_abi
  CC       hwloc_obj_infos.o
  CCLD     hwloc_obj_infos
  CC       hwloc_iodevs.o
  CCLD     hwloc_iodevs
  CC       cpuset_nodeset.o
  CCLD     cpuset_nodeset
  CC       memattrs.o
  CCLD     memattrs
  CC       memtiers.o
  CCLD     memtiers
  CC       cpukinds.o
  CCLD     cpukinds
  CC       xmlbuffer.o
  CCLD     xmlbuffer
  CC       gl.o
  CCLD     gl
/usr/bin/make  check-TESTS
PASS: hwloc_api_version
PASS: hwloc_list_components
PASS: hwloc_bitmap
PASS: hwloc_bitmap_string
PASS: hwloc_bitmap_compare_inclusion
PASS: hwloc_get_closest_objs
PASS: hwloc_get_obj_covering_cpuset
PASS: hwloc_get_cache_covering_cpuset
PASS: hwloc_get_largest_objs_inside_cpuset
PASS: hwloc_get_next_obj_covering_cpuset
PASS: hwloc_get_obj_inside_cpuset
PASS: hwloc_get_shared_cache_covering_obj
PASS: hwloc_get_obj_below_array_by_type
PASS: hwloc_get_obj_with_same_locality
PASS: hwloc_bitmap_first_last_weight
PASS: hwloc_bitmap_singlify
PASS: hwloc_type_depth
PASS: hwloc_type_sscanf
PASS: hwloc_bind
PASS: hwloc_get_last_cpu_location
PASS: hwloc_get_area_memlocation
PASS: hwloc_object_userdata
PASS: hwloc_synthetic
PASS: hwloc_backends
PASS: hwloc_pci_backend
PASS: hwloc_is_thissystem
PASS: hwloc_distances
PASS: hwloc_groups
PASS: hwloc_insert_misc
PASS: hwloc_topology_allow
PASS: hwloc_topology_restrict
PASS: hwloc_topology_dup
PASS: hwloc_topology_diff
PASS: hwloc_topology_abi
PASS: hwloc_obj_infos
PASS: hwloc_iodevs
PASS: cpuset_nodeset
PASS: memattrs
PASS: memtiers
PASS: cpukinds
PASS: xmlbuffer
PASS: gl
============================================================================
Testsuite summary for hwloc 2.8.0
============================================================================
# TOTAL: 42
# PASS:  42
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
Making check in ports
/usr/bin/make   \

make[4]: Nothing to be done for `all'.
Making check in xml
/usr/bin/make  check-TESTS
PASS: 8intel64-4n2t-memattrs.xml
PASS: 16amd64-8n2c-cpusets.xml
PASS: 16amd64-4distances.xml
PASS: 16amd64-4distances.console.output
PASS: 16em64t-4s2c2t.xml
PASS: 16em64t-4s2c2t-offlines.xml
PASS: 16em64t-4s2c2t.console.output
PASS: 16-2gr2gr2n2c+misc.xml
PASS: 16-2gr2gr2n2c+misc.console.output
PASS: 16intel64-manyVFs.xml
PASS: 16intel64-manyVFs.console.output
PASS: 16intel64-manyVFs.console.nocollapse.output
PASS: 24em64t-2n6c2t-pci.xml
PASS: 32em64t-2n8c2t-pci-noio.xml
PASS: 32em64t-2n8c2t-pci-normalio.xml
PASS: 32em64t-2n8c2t-pci-wholeio.xml
PASS: 64intel64-3g2n+2n-irregulargroups+pci.xml
PASS: 64intel64-3g2n+2n-irregulargroups+pci.console.output
PASS: 8intel64-fakeKNL-A2A-hybrid.rootattachednumas.xml
PASS: 64intel64-fakeKNL-SNC4-hybrid.xml
PASS: 96em64t-4n4d3ca2co-pci.xml
PASS: 192em64t-12gr2n8c2t.xml
PASS: 192em64t-24n8c2t.xml
PASS: power8gpudistances.xml
PASS: fakeheterodistances.xml
PASS: fakecpukinds.xml
PASS: 8em64t-2p2ca2co-nonodesets.v1tov2.xml
PASS: 8ia64-2n2s2c+1n.v1tov2.xml
PASS: 16amd64-4distances.v1tov2.xml
PASS: 16amd64-4distances.v2tov1.xml
PASS: 2intel64-1n2c-numaroot.v1tov2.xml
PASS: 28intel64-2p2g7c-CoDgroups.v1tov2.xml
PASS: 28intel64-2p2g7c-CoD.nogroups.v1tov2.xml
PASS: 8intel64-fakeKNL-A2A-hybrid.rootattachednumas.v1tov2.xml
PASS: 8intel64-fakeKNL-A2A-hybrid.rootattachednumas.v2tov1.xml
PASS: 64intel64-fakeKNL-SNC4-hybrid.v1tov2.xml
PASS: 64intel64-fakeKNL-SNC4-hybrid.v2tov1.xml
============================================================================
Testsuite summary for hwloc 2.8.0
============================================================================
# TOTAL: 37
# PASS:  37
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[2]: Nothing to be done for `check-am'.
Making check in contrib/systemd
make[1]: Nothing to be done for `check'.
Making check in contrib/completion
make[1]: Nothing to be done for `check'.
Making check in contrib/misc
/usr/bin/make  hwloc-tweak-osindex
  CC       hwloc-tweak-osindex.o
  CCLD     hwloc-tweak-osindex
Making check in contrib/hwloc-ps.www
make[1]: Nothing to be done for `check'.
Making check in doc
/usr/bin/make  check-recursive
Making check in examples
/usr/bin/make  hwloc-hello hwloc-hello-cpp cpuset+bitmap+cpubind nodeset+membind+policy get-knl-modes gpu sharedcaches
  CC       hwloc-hello.o
  CCLD     hwloc-hello
  CXX      hwloc-hello-cpp.o
  CXXLD    hwloc-hello-cpp
  CC       cpuset+bitmap+cpubind.o
  CCLD     cpuset+bitmap+cpubind
  CC       nodeset+membind+policy.o
  CCLD     nodeset+membind+policy
  CC       get-knl-modes.o
  CCLD     get-knl-modes
  CC       gpu.o
  CCLD     gpu
  CC       sharedcaches.o
  CCLD     sharedcaches
/usr/bin/make  check-TESTS
PASS: hwloc-hello
PASS: hwloc-hello-cpp
============================================================================
Testsuite summary for hwloc 2.8.0
============================================================================
# TOTAL: 2
# PASS:  2
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[3]: Nothing to be done for `check-am'.
make[1]: Nothing to be done for `check-am'.
barracuda156 commented 1 year ago

@mirkomyl By the way, there is one more related bug: src/mpi/distr_matrix.c includes malloc.h unconditionally, but it is Linux-specific header. At minimum, it should not be included on macOS.

mirkomyl commented 1 year ago

Unless this begins to cause issues on Linux, fixing this is not a priority.

barracuda156 commented 1 year ago

Unless this begins to cause issues on Linux, fixing this is not a priority.

Well, wrong include is trivially fixed, but unless tests are fixed, no real point.