Open sblauth opened 3 months ago
Hi there, thanks for the bug report! Can you please point us to the flags that you are talking about, and check if they still exist in scotch 7? If they do, would be happy to add them to the builds here.
Thanks for the quick reply. I will try to point you to the flags that I believe are the ones causing the problems, but I am really no expert in compiling things - I usually let conda do the work for me (sorry about that).
So investigating the recipe directory in the 6.0.9 PR https://github.com/regro-cf-autotick-bot/scotch-feedstock/tree/4532f8f5ec7e4094d7df9f6e317c24eb01f1eaf7/recipe it seems to me that the build.sh file used to build scotch is using the compile flags defined in Makefile.inc. There, the flag DCOMMON_RANDOM_FIXED_SEED
is set at https://github.com/regro-cf-autotick-bot/scotch-feedstock/blob/4532f8f5ec7e4094d7df9f6e317c24eb01f1eaf7/recipe/Makefile.inc#L20
If I see this correctly, this option is not set in the current build script https://github.com/conda-forge/scotch-feedstock/blob/main/recipe/build-scotch.sh I've looked in the recipe folder and could not find these flags being applied anywhere.
Moreover, it also seems that the flags DCOMMON_PTHREAD
and DSCOTCH_PTHREAD
are not set anymore in scotch 7, whereas they have been set in 6.0.9.
I guess that these are the compiler flags which are responsible for the (non)-determinism of parallel runs. Based on the change log, these flags should still be available for scotch 7.
In the cmake build that we use, these flags always seem to be set: https://github.com/live-clones/scotch/blob/82ec87f558f4acb7ccb69a079f531be380504c92/src/CMakeLists.txt#L49
So I’m not sure what’s causing the issue. Maybe someone else knows - I’m not overly familiar with scotch.
Okay, thanks a lot. Also the DCOMMON_PTHREAD
and DSCOTH_PTHREAD
flags seem to be set there - so this is not the issue.
If anyone else has an idea what could cause this non-determinism I would be really happy. I can also provide some examples with FEniCS that show the difference in behavior with scotch 7.0.4 and 6.0.9.
Comment:
Hello everyone,
I have a question / issue with scotch 7. Recently, I could update the dependencies of my code to use scotch 7. However, I have since experienced some issues when running my code in parallel - so I guess this is related to ptscotch / libptscotch, but I am not entirely sure. I am using FEniCS, which in turn uses scotch for mesh partitioning and graph reordering. Since switching to scotch 7, some tests at https://github.com/sblauth/cashocs/ fail irreproducibly / non-deterministically when run in parallel. I know that this problem is related to scotch as changing the mesh partitioner to ParMETIS (which FEniCS also supports) does not raise any problems. Due to licensing issues, I would prefer to stick with scotch as mesh partitioning tool.
I have investigated the recipe for the conda-forge build a bit and it seems to me that the previous version (6.0.9), which works fine for me, sets some deterministic build flags, whereas version 7.0.4, with which I have the issues, does not?
As far as I have seen, this could be addressed dynamically in scotch 7 now (using contexts) - however, as I am using FEniCS from python, I have no idea how to do so - it seems that this cannot be done with environment variables.
Are my observations regarding determinism in the conda-forge build correct? Is there any way for me, who uses scotch via FEniCS, to restore the parallel determinism? Or would it be thinkable to provide a deterministic conda-forge scotch build? I am also happy to provide further information if this is required.
Thanks a lot in advance, Sebastian