Mpi - Githubissues

adammoody commented 4 years ago

Add optional MPI interface, which extends all AXL API calls with a _comm version that takes a communicator argument. This enables transfer operations to be collective across a set of processes. Return codes are the same on all ranks.

Uses DTCMP to identify the union of directories in order to minimize mkdir calls if ranks have shared destination directories.

Depends on https://github.com/ECP-VeloC/AXL/pull/71

tonyhutter commented 4 years ago

Just so I'm understanding it right, the benefits of an MPI-aware AXL would let us do:

Efficient directory creation (all ranks don't try to create the same directory)
A shared transfer handle for BBAPI transfers (which rank0 could easily write to a state file on GPFS)

Is there anything else I'm missing?

adammoody commented 4 years ago

That's right, those are the two main things for now.

Actually, there are two BBAPI versions we could try: 1) Rank 0 allocates a single, shared handle, and then MPI_Bcast the handle to all ranks to be used. This would be nice since there is one handle id we can use to test and cancel the transfer. However, in earlier testing, the shared-handle performance was much worse than handle-per-process. 2) Allocate a handle per process as we're doing now, but MPI_Gather them to rank 0 for writing to a single file, which can be MPI_Scatter'ed back during AXL_Init_comm after a restart to do a cancel.

tonyhutter commented 4 years ago

I have mixed feelings about this PR. On one hand, I do like the efficient directory creation aspect and agree that MPI would be the optimal way to do it. On the other hand, it's an additional code path that we will all have to test/maintain/support for the lifetime of AXL. It's also not compatible with current callers of AXL (like VeloC, filo). It would help if we could quantify how many users have the "millions of nodes mkdir'ing the same dir" problem. And for the users that do have the problem, what percentage of the total checkpoint time is the mkdir. Like, is the inefficient mkdir only adding 4 seconds to one specific user's 60min checkpoint? Just to throw out some numbers, we can do ~172k mkdir()/sec and ~465k opendir()/sec and ~454k access()/sec on butte GPFS.

My gut feeling is to put this PR on ice for now, and try to implement some non-MPI solutions to improve the directory creation first. That has the benefit of speeding up all existing software calling AXL. If these solutions don't give us the speed-up we need, then go back to this PR.

Ideas:

1 . Have AXL write all files to a unique per-node temp directory first, and then mkdir/move them them to their final destination at random times during the transfer. For example:

a). A million nodes all call AXL_Dispatch() at the same time requesting a copy to /gpfs/my_data/ckpt_[RANK]

b). On all nodes, AXL actually starts writing to to /gpfs/my_data._[NODE HOSTNAME]/ckpt_[RANK].

c). On all nodes, after a random amount of time after the start of the transfer, AXL tries to mkdir() /gpfs/my_data, and moves ckpt_[RANK] in there. The transfer to the file continues (since we're just moving the inode).

d). The transfer finishes.

This amortizes the collective mkdir() of /gpfs/my_data/ across the lifetime of the transfer.

2 . Another thing we could do is randomize the transfer ordering of the files in the transfer list. So if each node had added these files in its transfer list (with different file sizes):

/path1/big_dataset
/path2/small_dataset
/path3/more_data
/path4/more_stuff

...they could randomly chose which one to transfer 1st, 2nd, 3rd, and 4th. That would help spread out the open()/mkdir()/close() operations for these files, so that all nodes aren't creating, say, /path1/big_dataset at the same time.

I also don't know if MPI should be the ultimate solution to the "how to cancel BBAPI transfers" problem. We'd basically be saying: "if you want to use BBAPI, you have to use all these new AXL_*_comm() functions, because in the off-chance your job dies, it would be awkward to cancel the transfers." Instead of using MPI, we could try recording all the transfer handles to /gpfs, which, while theoretically inefficient, has never been benchmarked, and may fall into "it's fast enough, we don't care" territory. This: https://github.com/ECP-VeloC/AXL/issues/57#issuecomment-642219896 may also be a "good enough" solution for cancelling BB transfer without MPI.

So my vote would be, let's leave this PR as-is, implement the other solutions first, and then benchmark them against this PR. If we find the MPI speed-up is worth the added code complexity, then we should move forward with the PR.

adammoody commented 4 years ago

I do agree the mkdir gain is not enough by itself. We're doing one mkdir per file, so it doesn't make the scalability worse, really. Also the higher layers that call AXL can execute the mkdirs and tell AXL to not bother with mkdir.

adammoody commented 4 years ago

I'm fine with putting it on ice. I think it just comes down to the IBM BB integration. There are other ways to handle that.

On the other hand, it's an additional code path that we will all have to test/maintain/support for the lifetime of AXL.

I don't understand this point. The collective logic here has to go somewhere, whether it's Filo or SCR. Whereever it goes, it has to be tested. We don't save ourselves any work by not including it AXL, it just gets moved.

It's also not compatible with current callers of AXL (like VeloC, filo).

It don't see how it breaks compatibility. One can still use the non-MPI interface just fine, and in fact can build AXL without the MPI interface at all. It gets interesting if someone tries to mix use within the same app, but those problems exist in AXL already.

tonyhutter commented 4 years ago

b). On all nodes, AXL actually starts writing to to /gpfs/mydata.[NODE HOSTNAME]/ckpt_[RANK]. c). On all nodes, after a random amount of time after the start of the transfer, AXL tries to mkdir() /gpfs/mydata, and moves ckpt[RANK] in there. The transfer to the file continues (since we're just moving the inode).

On second thought, my idea here doesn't really help. Each node would still have to one mkdir() of thier temp dir at the beginning of the transfer, which would take the same amount of time as doing a mkdir of the final destination dir. In fact, it is probably slower, since you could get lucky and do an access() of the final dir (which is 2.6x faster than a mkdir) and have it already exist, avoiding the need for a mkdir.

tonyhutter commented 4 years ago

On the other hand, it's an additional code path that we will all have to test/maintain/support for the lifetime of AXL.

I don't understand this point. The collective logic here has to go somewhere, whether it's Filo or SCR. Whereever it goes, it has to be tested. We don't save ourselves any work by not including it AXL, it just gets moved.

My point was that if we can roll the mkdir reduction into our existing code, we can use our existing tests, and our existing APIs. No new documentation, tests, user-facing headers, or updates to filo/VeloC are needed. Additionally, if we benchmark the "1 million mkdirs at once" and find that the extra mkdir time is negligible, no changes are needed.

It's also not compatible with current callers of AXL (like VeloC, filo).

It don't see how it breaks compatibility.

Sorry, "compatible" was the wrong word to use. I meant that VeloC and filo wouldn't be able to take advantage of an MPI-aware AXL without code changes. They don't get the performance increases "for free" like they would if we were able to somehow bake them into existing APIs.

adammoody commented 3 years ago

This logic has moved to SCR where it is being maintained.

https://github.com/LLNL/scr/blob/develop/src/axl_mpi.h

If it would be useful to define a collective interface for AXL, we can pull the latest code from there.

ECP-VeloC / AXL

Mpi #69