ECP-VeloC / AXL

Asynchronous Transfer Library
MIT License
2 stars 8 forks source link

Define an AXL BB post-stage script #75

Open adammoody opened 4 years ago

adammoody commented 4 years ago

The IBM BB will transfer files even after the user job has ended. One can register a post-stage script that will run after the transfer ends. This transfer may either succeed or it may fail.

Can we provide a default AXL bb poststage script that people can use?

For example, this script can wait for any outstanding transfer to end. For any transfer that succeeds, it might rename the temporary _AXL file to the correct name. For any transfer that fails, it might delete the temporary, incomplete _AXL file.

tonyhutter commented 3 years ago

One idea is to have the poststage script call axl_cp -U <state_file> to "finalize" the transfers (mark them as done, rename the files to final filename). This would work right now without any code changes, but it is a little weird to have a user app call a test program like axl_cp. Another idea would be to create a new axl_finalize <state_file> binary whose sole purpose would be finalize transfers. The poststage script could then call that to finalize the transfers.

tonyhutter commented 3 years ago

The more I think about this, the more I like this setup:

  1. The post-stage script for AXL-only users would be axl_cp -U <state_file>. This requires no code changes.

  2. The post-stage script for SCR users would be a new scr_finalize command we'd add to SCR. That command would just call AXL_Resume()/AXL_Wait() to finalize transfers. I think it makes sense to have the SCR post-stage command in SCR itself, since that's going to be the software the user is interacting with (rather than AXL, which is just a low-level component they shouldn't need to concern themselves with). It also makes it so the user doesn't need to have AXL binaries in their $PATH for post-stage; they would only need the path to SCR's binaries, which would be less awkward for them to setup. Also, since SCR would be managing the user's state_file (or multiple state_files), you could just call scr_finalize with no arguments in the post-stage script and SCR would be smart enough to finalize all the user's SCR transfers for their SCR config.

adammoody commented 3 years ago

It may help to break this into two steps: 1) define a BB post-stage script that can be used by a pure AXL user (who is not using SCR at all) 2) define a BB post-stage script for SCR users

For 1) let's test using a pattern that we can expect from a typical HPC application. We'd expect each process in an MPI job to write one or more files, with those files potentially scattered through a directory tree. As one concrete example, the pattern below is pretty common:

timestep.0.root (written by rank 0)
timestep.0/
  data.0 (written by rank 0)
  data.1 (written by rank 1)
  ...
  data.N (written by rank N)

Under AXL, each process will have started its own transfer. We'll have one transfer handle and one state file per process.

We should test for scaling, since it's at larger scale where BB transfers provide the most benefit. On a full system run on sierra, we'll have 4*4096 = 16k procs. We can get to about 4*800 = 3200 procs on lassen. I get the feeling that we'll find some performance or functionality issues as we scale up.

We could then look at what additional logic we need to add for SCR.

adammoody commented 3 years ago

With axl_cp -U <state_file>, some things that come to mind... We could say that in order to use the AXL post-stage script, it is required that the user place all of their state files under one directory. The user could then specify that directory path as an argument to the AXL post-stage script (if we can do that), or they could hack the script and hardcode that directory path in the script, or something else...

The AXL post-stage script would then scan the directory to get the full list of state files, and then it would invoke axl_cp -U <state_file> for each file. At full sierra scale, note that means the script will need to call axl_cp 16k times, and that's where I get nervous that we'll turn up problems. Up to this point, we haven't done much testing in this part of the BB software.

adammoody commented 3 years ago

Side note: the original plan for SCR+BB, before we carved AXL into its own component, was to initiate a collective BB transfer in SCR and list all files under a single BB transfer id. In that case, a post stage script could wait on a single transfer id rather than dealing with one id per process. The scalability problem moves into the BB software under that model.

tonyhutter commented 3 years ago

Thanks @adammoody all that detail helps.

One thing I did want to mention is that I plan to do away with the state file for resumes (axl_cp -U -S <state_file>) and move to user always specifying the list of src/dst files when they do a resume. That gets around the state_file corruption issues and the "app dies halfway though AXL_Adding() it's files" problems (https://github.com/ECP-VeloC/AXL/issues/83).

adammoody commented 3 years ago

I thought we agreed in our meeting last week that neither of those is really a problem.

First, the kvtree code needs to be fixed according to https://github.com/ECP-VeloC/KVTree/issues/40, which will fix the corrupt file problem. We'll need this kvtree fix for other components, if we decide AXL doesn't require it.

Second, I think it's fine for us to say that it doesn't make sense to "resume" a transfer that was never started. If a job dies after AXL_Create and before AXL_Dispatch, we can change AXL_Resume to return an error and require the user to start over with a fresh transfer using a new AXL_Create/Add/Dispatch.

With the caveat that I haven't thought through everything, we should keep going with the state file until we're sure it doesn't work. Requiring the user to list all files moves the bookkeeping work from AXL back to the user, but I don't think it makes the task of bookkeeping any easier.

adammoody commented 3 years ago

For more background. The IBM BB software defines two types of post-stage scripts, and we eventually want to consider both.

In this first go at things, we're looking to define a script that plugs in as a second BB post-stage script. The second script runs on the job launch node after the user has lost access to their compute nodes. The BB software will still complete any transfer the user had started before they gave up their allocation. This second script lets us wait on those transfers and take action when they complete. Each transfer will either complete successfully, or it will error out. If it completes successfully, we want to then finalize the files (e.g., rename and set metadata). If the transfer errors out, it's not possible for the user to start a new transfer of those files at that point -- they're just out of luck. The best we can do for them then is to delete the temporary files.

As an example of where things get tricky if we require the user to list all files again, is actually in the success case. On success, we may want to set the metadata on the destination files to match the source files. However, at the point this second script runs, we can't access the source files anymore (only the BB software can). In particular, we can't stat() the source files to get the metadata values to use. So the user would need to save the source/dest list, including all metadata for the source files, which they'd have to then give to axl_cp in order for us to update the destination files appropriately.

Meanwhile, we already store this metadata in our state file.

adammoody commented 3 years ago

I suppose we could get that working without a state file if we also encode the metadata values as part of the temporary destination name. We currently preserve the uid, gid, mode bits, mtime, and atime values. But that leads us down the path of implementing this bookkeeping in two different places.

tonyhutter commented 3 years ago

Regarding the no-state_file method -

One idea would be to pre-create the temporary files before the transfer with all the metadata values set except the permission bits. The permission bits would have to be encoded in the temp file name extension (and set on finalize). This is to get around the case where you're transferring a read-only file.

tonyhutter commented 3 years ago

Somewhat related to this issue: https://github.com/IBM/CAST/issues/981