Race conditions when running Bohrium via MPI

dionhaefner commented 5 years ago

For the most part, running distributed Bohrium via MPI works beautifully, but I get a lot of these warnings:

Warning: couldn't write JIT kernels to disk to "/groups/ocean/software/bohrium/gcc/31072018/var/bohrium/cache". boost::filesystem::copy_file: File exists: "/tmp/bh_d31b/obj/d308041ab8c99a7f_22f66ca440bbfd49.so", "/groups/ocean/software/bohrium/gcc/31072018/var/bohrium/cache/d308041ab8c99a7f_22f66ca440bbfd49.so"

and sometimes it even crashes with

terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
  what():  boost::filesystem::last_write_time: No such file or directory: "/groups/ocean/software/bohrium/gcc/31072018/var/bohrium/cache/d308041ab8c99a7f_75b7c177b01cedc.so"

I guess that is because all processes write to the same cache directory? Could be a solution to use different caches for each process?

madsbk commented 5 years ago

@dionhaefner, could you set BH_OPENMP_CACHE_DIR to different locations for each MPI process just after importing mpi4py but before importing bohrium?

If it works, maybe we should have the Python bridge detect that mpi4py is imported and set BH_OPENMP_CACHE_DIR automatically.

dionhaefner commented 5 years ago

I guess that would fix it, but I'd still like all processes to read from the global cache (just not write to it if RANK > 0).

dionhaefner commented 5 years ago

This is getting out of hand when running many processes (e.g. 128) in parallel. I have to try 4-6 times before I can get a run started that doesn't trigger the race condition.

madsbk commented 5 years ago

Is #606 enough? Or do we have to detect this automatically? There is no portable way to detect MPI rank, but we could try to look for OMPI_COMM_WORLD_RANK, PMI_RANK, PMI_ID, etc. Or maybe have Python setting CACHE_READONLY before importing Bohrium

dionhaefner commented 5 years ago

It should be enough for now, I can just set the environment variable from within Python (by the way, is there a way to do runtime configuration via the Python API instead of handling environment variables?).

However I think this is not only related to the persistent cache, but also the temporary cache workdirs (in /tmp/bh_abcd). Maybe a bad hash? I'll try and reproduce it so you can see the error messages.

dionhaefner commented 5 years ago

Got it:

Warning: boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src" (1 attempt)
boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src"
Warning: boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src" (2 attempt)
boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src"
Warning: boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src" (3 attempt)
boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src"
Warning: boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src" (4 attempt)
boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src"
terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
  what():  boost::filesystem::create_directory: Permission denied: "/tmp/bh_087a/src"
[node441.cluster:mpi_rank_26][error_sighandler] Caught error: Aborted (signal 6)
srun: error: node441: task 26: Aborted (core dumped)

If I look at the parent directory of the one it tries to create, it exists from another user (hence the permission error), so this does seem to be a hash collision.

madsbk commented 5 years ago

@dionhaefner, can I get you to apply this change and see if it fix the problem?

diff --git a/core/jitk/codegen_util.cpp b/core/jitk/codegen_util.cpp
index cdaf35fc..ab9a3ad2 100644
--- a/core/jitk/codegen_util.cpp
+++ b/core/jitk/codegen_util.cpp
@@ -65,10 +65,10 @@ boost::filesystem::path get_tmp_path(const ConfigParser &config) {
     // On some systems `boost::filesystem::unique_path()` throws a runtime error
     // when `LC_ALL` is undefined (or invalid). In this case, we set `LC_ALL=C` and try again.
     try {
-        unique_path = boost::filesystem::unique_path("bh_%%%%");
+        unique_path = boost::filesystem::unique_path("bh_%%%%_%%%%_%%%%");
     } catch(std::runtime_error &e) {
         setenv("LC_ALL", "C", 1); // Force C locale
-        unique_path = boost::filesystem::unique_path("bh_%%%%");
+        unique_path = boost::filesystem::unique_path("bh_%%%%_%%%%_%%%%");
     }
     return tmp_path / unique_path;
 }

basically calling unique_path = boost::filesystem::unique_path("bh_%%%%_%%%%_%%%%"); with some more randomness

dionhaefner commented 5 years ago

Unfortunately I have no simple way to rebuild Bohrium and our Python stack on the cluster, so testing is a bit difficult at these scales. More randomness sounds like a good idea though. Is there a way to recover from a collision, e.g. try again with a different name?

madsbk commented 5 years ago

Does #609 fix the problem?

dionhaefner commented 5 years ago

Sounds like a good idea. I'll try it as soon as I can and let you know.

dionhaefner commented 5 years ago

Seems to work fine. I also moved the persistent cache to my user folder, which might be a sensible default (not much to gain from different users writing to / cleaning up a global cache at the same time).

bh107 / bohrium

Race conditions when running Bohrium via MPI #598