ECP-VeloC / VELOC

Very-Low Overhead Checkpointing System
http://veloc.rtfd.io
MIT License
53 stars 23 forks source link

Program not finishing in async mode #33

Closed PedrooHR closed 2 years ago

PedrooHR commented 4 years ago

Hi,

I'm testing VeloC with a heatdis example using the single-mode (using VELOC_Init_single) option. My cfg file contains:

scratch = /tmp/scratch
persistent = /tmp/persistent
mode = async

I'm using MPICH version 3.3.2, and VeloC 1.4 release, I'm not launching veloc-backend before running my program, and I'm using a single machine.

The issue is that when I run my program letting the VeloC library starts the backend by itself, my program doesn't finish (I think it gets stuck in the VELOC_Finalize function). The backend log seems to be normal.

If I start the backend before running the program everything goes fine.

Any idea of what is going on?

bnicolae commented 4 years ago

Hi @PedrooHR, can you please give us more details about your issue? Which heatdis test are you running, what did you change and what command lines are you using?

PedrooHR commented 4 years ago

Hi @bnicolae, thanks for the reply.

This is the heatdis test. I changed a few lines from one of the FTI tests. And this is the config file.

It is not a problem that occurs only with this example, it occurs with any application I do the same steps. Command lines are the following:

  1. mpicc heatdis.c -o heatdis -lveloc-client -lm
  2. mpirun -np 3 ./heatdis 4

Any time I run this without an active veloc-backend (letting the VeloC lib launch it, as available in VeloC 1.4 release), the application stuck at the end.

Logs:

In log lines, after Execution finished in ... seconds. the program won't finish.

If I launch the backend before, or one is active, or the sync mode is used in the config file, everything goes fine.

I'm a researcher and in our project, we are leveraging VeloC as a checkpoint library, part of our MPI Fault Tolerance lib, this FT lib will be transparent to the final user, so it would be nice to use this feature of VeloC 1.4.

bnicolae commented 4 years ago

@PedrooHR, your test program has multiple problems: (1) you are specifying a relative path in the config file for persistent and scratch (which means the client and the backend may use different directories, depending on where they are launched from; (2) you do not check the result of any VELOC operation (which means the initialization or checkpointing may not be successful but you don't care and simply continue); (3) you are using a hardcoded config file name, again with a relative path (is it in the same directory where you run your program from?)

PedrooHR commented 4 years ago

Hi @bnicolae, I've changed all paths in config (using /tmp/scratch and /tmp/persistent) file and in the cpp file (in the VELOC_init_single function) to absolute paths. And, I'm now checking every Veloc function following this example pattern:

if (VELOC_Checkpoint("heatdis", ++v) != VELOC_SUCCESS) { 
   printf("CP Failed\n"); 
   return 1; 
} 

I'm also not sure I understood what you mean with "hardcoded config file name" in (3).

The backend and application are running on the same machine.

The problem still the same as before.

bnicolae commented 4 years ago

Did you also check VELOC_Init_single to make sure it returns VELOC_SUCCESS? By (3) I mean you specify "heatdis.cfg" in VELOC_Init_single, which is relative to the current directory. Can you please attach the log of the active backend?

PedrooHR commented 4 years ago

I've checked VELOC_Initi_single too, all functions are working.

I've already changed the cfg file path to be an absolute path in VELOC_Init_single in the previous comment.

Here is the log of the active backend with the last modifications. (As you can see, it's similar as reported in this comment)

Thanks in advance.

bnicolae commented 4 years ago

Ok, in that case you may want to check where you installed VELOC. Did you install a previous version too? Maybe the client is running an old veloc-backend. Make sure you run "export VELOC_BIN=". If that does not help either, try running "ctest --verbose" in your "build" folder. If this fails, please include a log of the auto-install.py script (used to compile and install VELOC).

PedrooHR commented 4 years ago

Yes, I've installed previous versions of Veloc. But I'm sure that the backend is from this latest version. I've not set VELOC_BIN env, but I have <veloc_install_dir>/lib in my LD_LIBRARY_PATH and LIBRARY_PATH, and <veloc_install_dir>/bin in my PATH, so the lib can find veloc-backend from PATH. I've double-checked and only one bin and lib path can be reached (of a new cleaner installation I made), the previous installations of Veloc are not reacheable.

As I said, I've made a new installation of Veloc (which veloc-backend returns the bin path of the new installation).

The ctest --verbose is fine, as you can check here

This is the new installation log ($ ./auto-install.py install).

I'm also having the same problem running inside the docker container we use in our CI (the container has the latest version of Veloc and MPICH). I understand those problems could be a thing on my personal computer, but they should not happen inside the container with only one version.

bnicolae commented 4 years ago

Ok, can you please share your full code and build/test script as a zip? I can try to see if I can reproduce this problem

PedrooHR commented 4 years ago

Hi @bnicolae

Here is the zip with the test case, just make and make test (with no veloc-backend active) to test the program.

Thanks.

bnicolae commented 4 years ago

@PedrooHR I cannot reproduce this error. For me, VeloC is working just fine. Can you tell me more about your setup? What Linux distribution are you using? Maybe you are using a customized older version (1.65.1) of Boost? Normally auto-install should download and use the latest version automatically (which as of now is 1.74.0).

Alternatively you can try to build VELOC without Boost, like so: ./auto-install.py --protocol socket_queue

PedrooHR commented 4 years ago

Hi @bnicolae, I've tested ./auto-install.py --protocol socket_queue <install_dir> and having same problem.

I've checked the installation log, in my PC veloc was using version 1.65.1 of Boost, but in the container (where I experienced the same problem) the version of boost was the latest one, 1.74.0.

My notebook run Ubuntu 18.04 and the container is also based in Ubuntu 18.04.

bnicolae commented 4 years ago

@PedrooHR, can you try to run a different distribution in your container? If this doesn't work can you provide a Docker file so I can recreate your container?

PedrooHR commented 4 years ago

Hi @bnicolae

Here is the container we are using in our research. Veloc is installed under /opt/veloc/. You can check line 46 in the Image layers to see how veloc was installed.

This image doesn't container the test I sent you a few messages ago, só you will probably need to copy the test into the container.

Thanks.

PedrooHR commented 3 years ago

Just updating, if you could not get the container from the link above in time, you can look here, search for ubuntu18.04-cuda10.2-mpich for the right container.

Thanks.

bnicolae commented 3 years ago

Thanks @PedrooHR, I'll take a look next week after SC20 is over, lots of stuff going on right now

bnicolae commented 3 years ago

@PedrooHR, I can confirm your issue. However, this is not due to VELOC, it's because of Hydra, the process launcher of mpich. Apparently they keep track of all processes launched through MPI, including the processes launched by the MPI ranks, and wait for them to finish. Normally, they should not do this but rather keep track of process sessions (so that you can "detach" processes just like deamons do). In any case, we will discuss this with the mpich team. Until then, you can try newer versions of mpich (maybe this is fixed), switch to OpenMPI or simply launch the backend in a script you supply to mpirun like this:

$ mpirun -np N <script.sh> <app> <parameters>
script.sh:
#!/bin/bash
veloc-backend &
$* 
PedrooHR commented 3 years ago

Thanks @bnicolae I'll try that.

bnicolae commented 3 years ago

@PedrooHR: we have a new mode for VELOC where the active backends run as threads in existing MPI ranks of the application (one rank per node is elected as leader to run the active backend). You can try that out by setting the "threded = true" configuration option in the VELOC config file. Let me know if this works.