Closed PedrooHR closed 2 years ago
Hi @PedrooHR, can you please give us more details about your issue? Which heatdis test are you running, what did you change and what command lines are you using?
Hi @bnicolae, thanks for the reply.
This is the heatdis test. I changed a few lines from one of the FTI tests. And this is the config file.
It is not a problem that occurs only with this example, it occurs with any application I do the same steps. Command lines are the following:
mpicc heatdis.c -o heatdis -lveloc-client -lm
mpirun -np 3 ./heatdis 4
Any time I run this without an active veloc-backend
(letting the VeloC lib launch it, as available in VeloC 1.4 release), the application stuck at the end.
Logs:
In log lines, after Execution finished in ... seconds.
the program won't finish.
If I launch the backend before, or one is active, or the sync
mode is used in the config file, everything goes fine.
I'm a researcher and in our project, we are leveraging VeloC as a checkpoint library, part of our MPI Fault Tolerance lib, this FT lib will be transparent to the final user, so it would be nice to use this feature of VeloC 1.4.
@PedrooHR, your test program has multiple problems: (1) you are specifying a relative path in the config file for persistent and scratch (which means the client and the backend may use different directories, depending on where they are launched from; (2) you do not check the result of any VELOC operation (which means the initialization or checkpointing may not be successful but you don't care and simply continue); (3) you are using a hardcoded config file name, again with a relative path (is it in the same directory where you run your program from?)
Hi @bnicolae, I've changed all paths in config (using /tmp/scratch
and /tmp/persistent
) file and in the cpp file (in the VELOC_init_single
function) to absolute paths. And, I'm now checking every Veloc function following this example pattern:
if (VELOC_Checkpoint("heatdis", ++v) != VELOC_SUCCESS) {
printf("CP Failed\n");
return 1;
}
I'm also not sure I understood what you mean with "hardcoded config file name" in (3).
The backend and application are running on the same machine.
The problem still the same as before.
Did you also check VELOC_Init_single to make sure it returns VELOC_SUCCESS? By (3) I mean you specify "heatdis.cfg" in VELOC_Init_single, which is relative to the current directory. Can you please attach the log of the active backend?
I've checked VELOC_Initi_single too, all functions are working.
I've already changed the cfg file path to be an absolute path in VELOC_Init_single in the previous comment.
Here is the log of the active backend with the last modifications. (As you can see, it's similar as reported in this comment)
Thanks in advance.
Ok, in that case you may want to check where you installed VELOC. Did you install a previous version too? Maybe the client is running an old veloc-backend. Make sure you run "export VELOC_BIN=
Yes, I've installed previous versions of Veloc. But I'm sure that the backend is from this latest version. I've not set VELOC_BIN
env, but I have <veloc_install_dir>/lib
in my LD_LIBRARY_PATH
and LIBRARY_PATH
, and <veloc_install_dir>/bin
in my PATH
, so the lib can find veloc-backend
from PATH. I've double-checked and only one bin and lib path can be reached (of a new cleaner installation I made), the previous installations of Veloc are not reacheable.
As I said, I've made a new installation of Veloc (which veloc-backend
returns the bin path of the new installation).
The ctest --verbose
is fine, as you can check here
This is the new installation log ($ ./auto-install.py install
).
I'm also having the same problem running inside the docker container we use in our CI (the container has the latest version of Veloc and MPICH). I understand those problems could be a thing on my personal computer, but they should not happen inside the container with only one version.
Ok, can you please share your full code and build/test script as a zip? I can try to see if I can reproduce this problem
Hi @bnicolae
Here is the zip with the test case, just make
and make test
(with no veloc-backend active) to test the program.
Thanks.
@PedrooHR I cannot reproduce this error. For me, VeloC is working just fine. Can you tell me more about your setup? What Linux distribution are you using? Maybe you are using a customized older version (1.65.1) of Boost? Normally auto-install should download and use the latest version automatically (which as of now is 1.74.0).
Alternatively you can try to build VELOC without Boost, like so:
./auto-install.py --protocol socket_queue
Hi @bnicolae, I've tested
./auto-install.py --protocol socket_queue <install_dir>
and having same problem.
I've checked the installation log, in my PC veloc was using version 1.65.1 of Boost, but in the container (where I experienced the same problem) the version of boost was the latest one, 1.74.0.
My notebook run Ubuntu 18.04 and the container is also based in Ubuntu 18.04.
@PedrooHR, can you try to run a different distribution in your container? If this doesn't work can you provide a Docker file so I can recreate your container?
Hi @bnicolae
Here is the container we are using in our research. Veloc is installed under /opt/veloc/. You can check line 46 in the Image layers to see how veloc was installed.
This image doesn't container the test I sent you a few messages ago, só you will probably need to copy the test into the container.
Thanks.
Just updating, if you could not get the container from the link above in time, you can look here, search for ubuntu18.04-cuda10.2-mpich
for the right container.
Thanks.
Thanks @PedrooHR, I'll take a look next week after SC20 is over, lots of stuff going on right now
@PedrooHR, I can confirm your issue. However, this is not due to VELOC, it's because of Hydra, the process launcher of mpich. Apparently they keep track of all processes launched through MPI, including the processes launched by the MPI ranks, and wait for them to finish. Normally, they should not do this but rather keep track of process sessions (so that you can "detach" processes just like deamons do). In any case, we will discuss this with the mpich team. Until then, you can try newer versions of mpich (maybe this is fixed), switch to OpenMPI or simply launch the backend in a script you supply to mpirun like this:
$ mpirun -np N <script.sh> <app> <parameters>
script.sh:
#!/bin/bash
veloc-backend &
$*
Thanks @bnicolae I'll try that.
@PedrooHR: we have a new mode for VELOC where the active backends run as threads in existing MPI ranks of the application (one rank per node is elected as leader to run the active backend). You can try that out by setting the "threded = true" configuration option in the VELOC config file. Let me know if this works.
Hi,
I'm testing VeloC with a heatdis example using the single-mode (using
VELOC_Init_single
) option. My cfg file contains:I'm using MPICH version 3.3.2, and VeloC 1.4 release, I'm not launching veloc-backend before running my program, and I'm using a single machine.
The issue is that when I run my program letting the VeloC library starts the backend by itself, my program doesn't finish (I think it gets stuck in the
VELOC_Finalize
function). The backend log seems to be normal.If I start the backend before running the program everything goes fine.
Any idea of what is going on?