ElmerCSC / elmerfem

Official git repository of Elmer FEM software
http://www.elmerfem.org
Other
1.2k stars 320 forks source link

Give option to exclude compiling internal UMFPACK in cmake #578

Open juhanikataja opened 1 month ago

juhanikataja commented 1 month ago

Elmer internal UMFPACK is not needed on supercomputing platforms and may cause problems with picky compiler stacks.

There should be a cmake build variable e.g. DISABLE_INTERNAL_UMFPACK which would disable compiling UMFPACK.

Changes would probably include:

mmuetzel commented 1 month ago

I hope you don't mind me asking which issues those are. Do these issues still occur if you build with a current version of UMFPACK from SuiteSparse and configure with -DEXTERNAL_UMFPACK=ON?

mmuetzel commented 1 month ago

The changes from #579 should allow building without UMFPACK.

juhanikataja commented 1 month ago

It's not actually clear what is causing the incompatibility. The passage about picky compilers was weird humor that has no place in github issues, sorry about that.

The reasoning about giving an option to disable internal umfpack even without external is that

Do we assume internally in some parts of elmer the existence of UMFPACK or is it just in the solvers?

Currently there is a crude workaround at 778d9a3b4cb1a079205564fb85c67d6a10ac5a7a where this issue stemmed from:

In file included from .../elmerfem/umfpack/src/umfpack/umf_analyze.c:29:
In file included from .../elmerfem/umfpack/src/umfpack/include/umf_internal.h:88:
In file included from .../elmerfem/umfpack/src/umfpack/../amd/amd_internal.h:72:
In file included from /usr/include/stdlib.h:394:
In file included from /usr/include/sys/types.h:29:
/usr/include/bits/types.h:155:34: error: typedef redefinition with different types ('struct __fsid_t' vs 'struct __fsid_t')
typedef struct { int __val[2]; } __fsid_t;
                                 ^
/usr/include/bits/types.h:155:26: note: previous definition is here
__STD_TYPE __FSID_T_TYPE __fsid_t;      /* Type of file system IDs.  */
juharu commented 1 month ago

There is an unconditional dependence to umfpack within VankaCreate.F90 in "CircutPrec()" , i guess this could be relatively easily replaced by something more general... Nothing else springs to mind atm...

mmuetzel commented 1 month ago

Thank you for clarifying.

True. UMFPACK (even the current upstream version) relies heavily on the respective BLAS and LAPACK implementations for parallelization. It isn't multi-threaded by itself if I understand correctly.

The compilation error you are seeing is odd. It looks like it is complaining about two conflicting type definitions in the same header?? That is odd indeed. Can you share that header file (/usr/include/bits/types.h)? Maybe some incompatible combination of preprocessor definitions are set?

juharu commented 1 month ago
* On a supercomputer, there is no point in having a non-parallel direct solver. 

This might also not always be so, even if you are running MPI parallel, each task might have something to solve partitionwise (as in "CircuitPrec()"), for example. Thinking about it, I perhaps would be reluctant to not to have an always available robust linear solver for elmer solver...

tzwinger commented 1 month ago

Am I mistaken to conclude that the conflicting type definition is something platform-specific and not able to be fixed in what comes to the umfpack included in Elmer? We experience the error on a HPE-Cray system with the Cray-CCE compiler set.

mmuetzel commented 1 month ago

I'm not sure I understand the compiler error message correctly. It looks like it is referring to a conflicting type definition. But the locations that the error message presents look like they are from the same file and even the same line in that file. However, it presents different content of that line. I honestly don't know what that means. Is this a compiler error? Or some issue in glibc(?) heeders (missing inclusion guards?)? Or something completely different?

It might be that the line that it presents looks different because it is pre-processed for one and the "raw" source code for the other?

Maybe for some reason that header is included twice (and the inclusion guard isn't working for some reason). Iiuc, headers are treated slightly differently when they are included via the system include path or via a user include path. Maybe, the header somehow ends up being included via both and the compiler doesn't "realize" it is the same??? If that is the case, compiling with -std=gnu11 (or whatever flag the Cray compiler needs for C11). C11 should allow multiple typedefs if they are compatible afaict.

Does the issue go away if you add that flag to CMAKE_C_FLAGS when configuring ElmerFEM?

juhanikataja commented 1 month ago

The problem appears to be in the combination of specific cray programming environment modules where -fopenmp flag looks like it is injecting typedef ... __fsid_t in the global scope.

Since we don't hopefully need fsid_t, I suggest we just use the workaround 778d9a3b4cb1a079205564fb85c67d6a10ac5a7a mentioned earlier in devel.

The following code reproduces the bug with cc -fopenmp bug.c when those modules are loaded:

typedef struct { int __val[2]; } __fsid_t;
typedef __fsid_t fsid_t;

int main(int argc, char** argv) {
  return 0;
}
mmuetzel commented 1 month ago

This issue seems to be caused by the fact that the build rules of UMFPACK (and AMD) are preprocessing the original sources with a different set of flags than when compiling. The situation with the Cray compiler might not be the only affected build toolchain. (Theoretically, doing that could result in a "successful" compilation but undefined (or unexpected) behavior on runtime.)

The changes proposed in #587 avoid the separate preprocessing step entirely by using CMake object libraries to build parts of the sources for the "actual" libraries with different sets of preprocessor flags.

Do you still see the issues from here with the changes from PR #587?