LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 35 forks source link

Account for BBAPI post-stage in SCR #243

Open adammoody opened 4 years ago

adammoody commented 4 years ago

Using its post-stage functionality, the BBAPI will continue to transfer files in the background even after an application ends. LSF allows the user to register poststage scripts in the bsub command to detect and act on the status of post-stage transfers. To allow for BBAPI transfers at the end of an SCR job, we need to write some logic to finalize those transfers. There are numerous items to address.

In AXL we'd need to effectively wait/resume on each of those in order to finish its transfers, e.g., rename files from temporary to final file name and set metadata on final files https://github.com/ECP-VeloC/AXL/issues/75

In SCR, we'd need to update the SCR index file to mark the transfer as good or bad. Assuming that most transfers succeed, it would be nice to wait and mark the checkpoint as valid, so that any subsequent run could then restart from that checkpoint. However, that also introduces a race condition in which job 1 flushes a checkpoint in post-stage, but job 2 starts up before that transfer has completed. In that case, job 2 would restart from an older checkpoint, and then it would likely rewrite and reflush the same checkpoint as job 1, perhaps while the system is still busy flushing job 1's checkpoint. At that point, we'd have two different jobs trying to write the same checkpoint, and that's going to break things.

We could update the SCR scavenge logic to use AXL/BBAPI to start a transfer and exit the job allocation instead of synchronously copying files to the file system at the end of the allocation. We'd need a post-stage script that both waits on the transfer to finish and executes any rebuild logic, as our scavenge normally does.

We need to modify the existing scavenge logic to detect and deal with async transfers in case the dataset it's trying to scavenge is the one being transferred. We'd want to at least add the redundancy files. That could be done as a separate transfer, or we could cancel the first and restart it after adding the redundancy files.

As a short term fix, we can modify SCR to avoid starting any async transfers when the job gets close to the end of its allocation time limit. All transfers would switch to use synchronous mode.

tonyhutter commented 4 years ago

Copying my comment from https://github.com/ECP-VeloC/AXL/issues/75#issuecomment-693626517:

One idea is to have the poststage script call axl_cp -U <state_file> to "finalize" the transfers (mark them as done, rename the files to final filename). This would work right now without any code changes, but it is a little weird to have a user app call a test program like axl_cp. Another idea would be to create a new axl_finalize <state_file> binary whose sole purpose would be finalize transfers. The poststage script could then call that to finalize the transfers.

tonyhutter commented 4 years ago

Copying my comment from https://github.com/ECP-VeloC/AXL/issues/75#issuecomment-696956530

The more I think about this, the more I like this setup:

  1. The post-stage script for AXL-only users would be axl_cp -U <state_file>. This requires no code changes.

  2. The post-stage script for SCR users would be a new scr_finalize command we'd add to SCR. That command would just call AXL_Resume()/AXL_Wait() to finalize transfers. I think it makes sense to have the SCR post-stage command in SCR itself, since that's going to be the software the user is interacting with (rather than AXL, which is just a low-level component they shouldn't need to concern themselves with). It also makes it so the user doesn't need to have AXL binaries in their $PATH for post-stage; they would only need the path to SCR's binaries, which would be less awkward for them to setup. Also, since SCR would be managing the user's state_file (or multiple state_files), you could just call scr_finalize with no arguments in the post-stage script and SCR would be smart enough to finalize all the user's SCR transfers for their SCR config.

tonyhutter commented 4 years ago

The post-stage script for SCR users would be a new scr_finalize command we'd add to SCR.

We could also call it scr_poststage

adammoody commented 4 years ago

The post-stage script for SCR users would be a new scr_finalize command we'd add to SCR.

We could also call it scr_poststage

Yes, I like the idea of an SCR variant like an scr_poststage. SCR will have additional work to do that a stand alone AXL user does not. In particular, we need to update the status of the dataset in the SCR index file after the transfer ends (success or failure). We'll need to add code in SCR to track all outstanding transfers (and which transfer handles belong to which dataset id) and then issue axl_cp for each one. Then depending on the status as each finishes, we can run an scr_index --add and scr_index --current command to update the index file. Eventually, for scavenge operations, we'll also want to attempt an scr_index --build within the scr script.

To start, let's focus on getting things working for a stand alone AXL user. To emulate what a stand alone user will be doing, we can use a test that runs a multi-process MPI job, where each process transfers a file (e.g., part of a checkpoint). We should extend that test to also handle a case where each process has started multiple outstanding transfers (e.g., each process wrote a checkpoint file and an output file, but those came at two different times in the program so that they were started in separate transfers).

adammoody commented 4 years ago

A large job will end up with lots of transfer handles, and we should test this at scale. For example, a single transfer at sierra scale which is running an MPI process per core would lead to 4096*40 transfer handles. A job may have multiple transfers outstanding k, which adds another term to that expression 4096*40*k.

We haven't stress tested the IBM software for poststage operations, so there may be scalability bugs or performance issues hiding for us to uncover.

tonyhutter commented 4 years ago

Just speaking to scr_poststage implementation details:

The best case scenario is for the user to call scr_poststage (provided by SCR) and have it handle everything. That is, the user wouldn't need to keep track of a state_file or list of src/dst paths to pass back to scr_poststage. They would just call bsub -stage storage=X:out=scr_poststage and the transfer completions would "just work".

In order to make that happen, we'd need a way for scr_poststage to lookup the transfers for its parent job. The minimum information needed would be a state_file or a list of the src/dst paths. Could scr_poststage just look in the SCR prefix directory (which would either be $SCR_PREFIX, or the working directory of the post-stage stage script) to lookup that info? Is there enough information in the prefix directory to get the list of src/dst paths for the transfers? Or could we possibly shove the state_files in the prefix directory?

Also, I dunno if this is relevant, but I noticed that the post-stage environment has LSF_STAGE_JOBID and LSB_SUB_JOB_ID set to the parents job ID. So there is a way for the post-stage script to know its parent job ID if we need that.

adammoody commented 4 years ago

Some things that will come into play here.

SCR maintains a "flush" file that tracks the set of dataset ids in cache that could be flushed and the status for each of those datasets, e.g., currently flushing via async or already flushed. This file is stored at $SCR_PREFIX/.scr/flush.scr.

Then for each dataset, there is a directory where SCR stores any metadata specific to that dataset. For dataset id = X, this is at $SCR_PREFIX/.scr/scr.dataset.X. We could put axl state files in there.

So one can look at the flush file to figure out whether any datasets are being transferred, and then for any dataset, look in its dataset directory for the state files.

tonyhutter commented 4 years ago

@adammoody thanks, having a $SCR_PREFIX/.scr/scr.dataset.X/state_file seems the way to go then. I'm trying to put together a test that will exercise all the moving parts...

tonyhutter commented 3 years ago

Quick update -

I'm currently able to do a checkpoint, cancel the transfer midway though, finish the transfer, and then manually create the summary file and add it to the index:

# Use test_api to create a checkpoint, but AXL_DEBUG_PAUSE_AFTER will halt the AXL
# transfer partway though to simulate the job ending.
$ AXL_DEBUG_PAUSE_AFTER=1 SCR_CONF_FILE=~/myscr.conf  ./test_api

# Verify that we can't load from the partially transmitted checkpoint
$ ./scr_index -l -p $BBPATH
DSET VALID FLUSHED             CUR NAME
   1 NO                            ckpt.1

# "Finish" the transfer by renaming the file to it's final name.  This is simulating
# the "BBAPI transferred the file in the background between job runs"
$ mv $BBPATH/ckpt.1/rank_0.ckpt._AXL $BBPATH/ckpt.1/rank_0.ckpt     

# Create a summary.scr file for our "finished" transfer, and update the flush file's
# "LOCATION" entry to say the file is flushed.

$ ./scr_flush_file --d $BBPATH -s 1 -S
ckpt.1

# Add our checkpoint into the index 
./scr_index -p $BBPATH --add=ckpt.1
Found `ckpt.1' as dataset 1 at $BBPATH/.scr/scr.dataset.1
Adding `ckpt.1' to index

# Verify our index now sees the finished checkpoint
./scr_index -l -p $BBPATH
DSET VALID FLUSHED             CUR NAME
   1 YES   2020-11-09T17:05:01     ckpt.1

The next step would be to actually do the AXL resume in scr_flush_file (the AXL_Init/AXL_Create/AXL_Resume steps). The issue with that is AXL requires you to pass the same xfer type as is in the state_file if you want to resume. We currently don't store this in any of the kvtrees, so scr_flush_file wouldn't know what the old xfer type was. The simplest solution would be to add a new AXL_XFER_STATE_FILE transfer type to AXL to tell it to "just use the old transfer type listed in the state_file". I've opened a PR to do just that: https://github.com/ECP-VeloC/AXL/pull/86

tonyhutter commented 3 years ago

Update: I've now added an -r option to scr_flush_file to AXL_Resume() a checkpoint transfer from for a particular dataset. I'm getting closer to actually writing the scr_poststage script that calls scr_flush_file/scr_index.

However, that also introduces a race condition in which job 1 flushes a checkpoint in post-stage, but job 2 starts up before that transfer has completed. In that case, job 2 would restart from an older checkpoint, and then it would likely rewrite and reflush the same checkpoint as job 1, perhaps while the system is still busy flushing job 1's checkpoint. At that point, we'd have two different jobs trying to write the same checkpoint, and that's going to break things.

I think now is the time to decide on what the behaviour should be. The simplest would be to just cancel any ongoing transfers in SCR_Init(). That might be the way to go, as the user could simply call scr_poststage as their stage-in script if they wanted to wait for the old transfers before their job runs. That could run into issues though. Its possible someone could execute a stage-in scr_poststage to resume the transfer, and then a few minutes later the previous stage-out scr_poststage could run. One possible solution would be to have scr_poststage set some state to only allow one scr_poststage instance per dataset to run at any time (maybe using a flock() on the dataset's state_file?)

tonyhutter commented 3 years ago

Side note - since users may want to call scr_poststage as their stage-in script (to wait for the old checkpoint transfer to complete), we may want to consider giving it a different name.

adammoody commented 3 years ago

Based on my understanding of the IBM BB software, I don't think that second job allocation can either cancel or wait on a transfer that was started in an earlier job allocation. It could probably sync by waiting for a state change in some SCR file though.

adammoody commented 3 years ago

In addition to the race condition, we have a couple more details to figure out even for a single job.

For a multi-rank job, we need to come up with a scheme to know whether the transfers succeeded for all ranks. Consider a two process job. If our scr_poststage script only sees one state file, and if the transfer associated with that file succeeds, is the transfer as a whole good or not? It could be that the second process had no files to transfer, so that it never started a transfer and never wrote a state file. On the other hand, maybe it did have files to transfer, but it failed before it wrote out its state file. In the first case, the transfer is complete and in the second case it is not. We'll need to be able to distinguish between those two.

I think we'll want to be able to support multiple transfers. For example, the job might have transferred two datasets, say a checkpoint and an output dataset, just before it shuts down. In this case, each process will have multiple state files, one for each of its transfers.

tonyhutter commented 3 years ago

Based on my understanding of the IBM BB software, I don't think that second job allocation can either cancel or wait on a transfer that was started in an earlier job allocation.

In general this is true. There is a hack though: if you export LSB_JOBID=<your old job ID>, and know your old transfer handle, you can lookup info on the old transfer and cancel it (I don't know about waiting on it, but I assume yes). I tested this out back in May, so they may have updated the BB software since then to prevent it. I dunno if we would ever actually want to do this since it's not exactly kosher to fake the job ID, even if temporarily. If we don't want to do that, we may be able to "cancel" a transfer from SCR's perspective by simply deleting the state_files from the index dir. Those transfers will finish at some point, but it doesn't really matter, as they'll be missing their state_file, and their transfers would never be finalized Thus, they wouldn't be seen as valid checkpoints by SCR. You'd just have some leftover rank_X.ckpt._AXL files laying around.

For a multi-rank job, we need to come up with a scheme to know whether the transfers succeeded for all ranks. Consider a two process job. If our scr_poststage script only sees one state file, and if the transfer associated with that file succeeds, is the transfer as a whole good or not? It could be that the second process had no files to transfer, so that it never started a transfer and never wrote a state file. On the other hand, maybe it did have files to transfer, but it failed before it wrote out its state file. In the first case, the transfer is complete and in the second case it is not. We'll need to be able to distinguish between those two.

I was envisioning that scr_poststage would blindly process and finalize all the state_files that exist in the index directory. So let's say you had ranks 0 and 1 kick off a transfer of their dataset 6 checkpoints using BBAPI at the end of their allocation. The index would look like:

ckpt.6/
    rank_0.ckpt._AXL
    rank_1.ckpt._AXL
    rank_0.state_file
    rank_1.state_file

After the files transfer, scr_poststage runs. It sees rank_0.state_file and rank_1.state_file and finalizes them. Afterwards you have:

ckpt.6/
    rank_0.ckpt
    rank_1.ckpt

So far so good - you have a valid dataset. No imagine the same thing happening, but rank 1 doesn't write the state file due to a failure. In that case, scr_postsage would only see rank_0.state_file, and finalize it, leaving dataset 6 with only rank_0.ckpt. I would assume then that SCR would see dataset 6 as incomplete, and fall back to loading dataset 5? If so, would that be the behavior we would want?

adammoody commented 3 years ago

Oh, I see. That might work, though there could be some rough edges that we'll want to clean up.

I had a different design in mind. My first thought was that our poststage script would follow the model of the scr_postrun/scavenge scripts and mark the state of the checkpoint in the index file based on whether it succeeded or failed to finalize everything. Really, we shouldn't be updating the index file to mark the checkpoint as valid unless we know that we successfully got all of the files. If files are missing for some rank, then when the user runs scr_index, they'd ideally see it marked as VALID=NO:

./scr_index -l -p $BBPATH
DSET VALID FLUSHED             CUR NAME
   1 NO    2020-11-09T17:05:01     ckpt.1

Having said that, if we do optimistically mark the dataset as COMPLETE=1 so that VALID=YES, the fetch logic in SCR that reads the checkpoint back in on restart should detect that some ranks are missing files and then update the index file to mark that checkpoint as FAILED, at which point the state would change to VALID=NO. The fetch logic would then fall back to attempt the next oldest checkpoint it can find.

We should review that code in SCR where we delete some files. It could use some double-checking and fresh testing since SCR was ported to the components.

If we go this route, to be clean about it, we should also consider defining a third state, like VALID=MAYBE. The reason being is that these checkpoints often serve as output datasets. If the user sees VALID=YES they will likely assume that they have all of their simulation output and stop running. This state is used to imply that all files in the dataset have been successfully written to the parallel file system, and so they are likely to still be valid unless something external to SCR has since messed them up. This issue becomes more acute when we circle back to add scavenge rebuild support on top of BB transfers.

Anyway, let's keep going with what you have in mind. That will give us a good starting point, even if we find we need to change it.

tonyhutter commented 3 years ago

@adammoody thanks I see what you're saying now. Yea, having scr_poststage mark VALID=YES only if the dataset is complete would be the best way to do it. I'd be nice if we had a scr_index --scan option that scr_poststage could call that looks to see if all the expected checkpoint files are present, and update VALID accordingly. Let me look into how difficult that would be to implement.

adammoody commented 3 years ago

For scavenge, the scr_postrun script executes scr_index --build <id> for a similar kind of scan. However, that particular function expects to find a bunch of additional files that the scavenge logic creates that will not exist in this case: https://github.com/LLNL/scr/blob/f877d26e5e18c57306f018c6cab285b54647cdf5/scripts/common/scr_postrun.in#L144

Perhaps we could define a second type of a scan or hook into that --build option. We'd need to read the rank2file map and check that each file exists.

tonyhutter commented 3 years ago

More updates/braindump:

  1. When our scr_poststage script runs, it will only have the information in the state file to finish the transfer, like:
$ ~/kvtree_print_file ./.scr/scr.dataset.1/rank_11.state_file
  FILE
    /tmp/bblv_hutter2_132955/tmp/hutter2/scr.defjobid/scr.dataset.1/rank_11.ckpt
      STATUS
        1
      DEST
        /p/gpfs1/hutter2/prefix/ckpt.1/rank_11.ckpt._AXL
  STATE
    1
  STATUS
    1
  NAME
    ckpt.1
  TYPE
    2
  STATE_FILE
    /p/gpfs1/hutter2/prefix/.scr/scr.dataset.1/rank_11.state_file

It will not have access to the original source file in the burst buffer. So, we need to encode the file size into state_file so that scr_poststage can know if the file is complete or not. Additionally, we need a way to set the file mode bits when we finalize. Luckily, we already have axl_meta_encode()/axl_meta_apply() which construct a kvtree of the metadata (including size/mode bits), so we just need to include that kvtree in our state_file. Along with this, we should set AXL_COPY_METADATAto be on by default.

  1. We need scr_poststage to be the 2nd post-stage script. The out=script1,script2 arg allows for two post-stage scripts, that get run at different times after the job. We want 2nd one, since that's the one that gets run after the transfer is done. Just use an empty comma for the first script, since we aren't using it, like:
bsub -nnodes 2 -stage storage=10:out=,scr_poststage ~/my_job_script

Just recording this info for posterity...