Open tonyhutter opened 4 years ago
You can assign this to me for now
An early observation:
Below is the script I'm using to test the BBAPI transfer. To test, do srun -n4 -N4 <this script>
from within the BB mount point:
#!/bin/bash
# Get BBAPI mount point
mnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd
cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE
SCR_CLUSTER_NAME=butte
SCR_FLUSH=1
STORE=/tmp GROUP=NODE COUNT=1
STORE=$mnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=/tmp/persist GROUP=WORLD COUNT=100
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE STORE=$mnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CNTLDIR=/tmp BYTES=1GB
CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF
SCR_CONF_FILE=~/myscr.conf ~/scr/build/examples/test_api
What I'm seeing is that when AXL tries to copy the file from /tmp/[blah]/ckpt to /mnt/[BB mount]/ckptdir/ckpt using BBAPI, it's falling back to using pthreads. This happens because AXL BBAPI first checks if the both source and destination support FIEMAP (file extents). If so, then it continues to use BBAPI to transfer the files. If not, then it falls back to using pthreads to transfer them. On cases where the destination file doesn't exist yet, it falls back to calling FIEMAP on the destination directory. This is problematic, as tmp and BBAPI filesystems don't support FIEMAP on directories, while ext4 does (which is what I originally tested on). So to get around this, we may want to fall back to doing a statfs() on the destination directory and checking f_type
against a whitelist of filesystems that we know support extents.
Sounds good. Related to this, as far as I understand it, supporting extents is a necessary but not a sufficient requirement for the BBAPI. The BB software uses extents, but even if a file system supports extents, there is no guarantee that it will work with the BBAPI.
IBM says that the BBAPI should work for transfers between a BB-managed file system on the SSD and GPFS, but any other combination only works by chance if at all. No other combinations of transfer pairs have been tested or designed to work.
Some quirks I'm noticing using the BBAPI to transfer:
Can I transfer using BBAPI? | src | dst | can xfer? |
---|---|---|---|
xfs | ext4 | Yes | |
ext4 | xfs | Yes | |
xfs | gpfs | Yes | |
ext4 | gpfs | No | |
gpfs | ext4 | No | |
ext4 | ext4 | Yes | |
tmpfs | any FS | No |
We may want to update AXL to not only whitelist BBAPI based on src/dest filesystem type, but also if BBAPI can actually do the transfer, and fallback to pthreads if necessary. So in the table above, we'd fallback to pthreads on /tmp <-> /p/gpfs transfers, and use BBAPI on the others.
If the user explicitly requests the BBAPI in their configuration and it doesn't work, I think it's fine to just error out. I think that would either be a configuration error on their part or a system software error, either of which they probably would like to know about so that it can get fixed.
Regarding the table in https://github.com/LLNL/scr/issues/163#issuecomment-611726077: /tmp on the system I tested on was actually EXT4, not tmpfs. I've since tested tmpfs (transfers don't work at all with BBAPI), and updated the table.
@adammoody I just opened https://github.com/ECP-VeloC/AXL/pull/64 which disables fallback to pthreads by default, but you can enable it again in cmake with -DENABLE_BBAPI_FALLBACK
. It's useful for me for testing, so I'd at least like to make it configurable. I've also updated it to use a FS type whitelist instead of checking if the FS supports extents.
I've been testing using the following script on one of our BBAPI machines:
#!/bin/bash
# Bypass mode is default - disable it to use AXL
export SCR_CACHE_BYPASS=0
ssdmnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd
gpfsmnt=$(mount | awk '/gpfs/{print $3}')
mkdir -p $gpfsmnt/`whoami`/testing
gpfsmnt=$gpfsmnt/`whoami`/testing
cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE
SCR_CLUSTER_NAME=`hostname`
SCR_FLUSH=1
STORE=/tmp GROUP=NODE COUNT=1 TYPE=pthread
STORE=$ssdmnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=$gpfsmnt GROUP=WORLD COUNT=1 TYPE=bbapi
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE STORE=$ssdmnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=NODE STORE=$gpfsmnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CNTLDIR=/tmp/hutter2 BYTES=1GB
CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF
SCR_CONF_FILE=~/myscr.conf ~/scr/build/examples/test_api
The output from the script shows SCR successfully using AXL in pthreads and BBAPI mode:
AXL 0.3.0: lassen788: Read and copied /mnt/bb_2c90f9ab469844e39364103cb0b0d928/hutter2/scr.defjobid/scr.dataset.44/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.44/rank_0.ckpt sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.45/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.45/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.46/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.46/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.47/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.47/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: Read and copied /p/gpfs1/hutter2/testing/hutter2/scr.defjobid/scr.dataset.48/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.48/rank_0.ckpt sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
So we know BBAPI and pthreads are working in the basic case.
Great! So far, so good.
A tip for getting the BB path, an alternative is to use the $BBPATH
variable, which will be defined in the environment of your job. Matching on /mnt/bb_
should work if all goes well. However, the BB software is still buggy, and it often leaves behind stray /mnt/bb_
directories. If you end up on a node where the BB failed to clean up, you'll see multiple /mnt/bb_
paths, only one of which is valid for your job.
While testing last week, I ran across a FILO bug and created a PR: https://github.com/ECP-VeloC/filo/pull/9
So this is a little bizarre:
I noticed in my tests that that SCR was reporting that it successfully transfered a 1MB file called "rank_0" from the SSD to GPFS using the BBAPI:
SCR v1.2.0: rank 0 on butte20: Initiating flush of dataset 22
AXL 0.3.0: butte20: Read and copied /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0 to /p/gpfs1/hutter2/rank_0 sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
SCR v1.2.0: rank 0 on butte20: scr_flush_sync: 0.285768 secs, 1.048576e+06 bytes, 3.499343 MB/s, 3.499343 MB/s per proc
SCR v1.2.0: rank 0 on butte20: scr_flush_sync: Flush of dataset 22 succeeded
But the resulting file was 0 bytes.
$ ls -l /p/gpfs1/hutter2/rank_0
-rw------- 1 hutter2 hutter2 0 May 20 09:55 /p/gpfs1/hutter2/rank_0
$ ls -l /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0
-rw------- 1 hutter2 hutter2 1048576 May 20 09:49 /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0
I tried the same test with axl_cp -X bbapi ...
, and got the same 0 byte file. I then manually created a 1MB file on the SSD and transferred it using the BBAPI to GPFS, and it worked:
$ dd if=/dev/zero of=/mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb bs=1M count=1
$ ls -l /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb
-rw------- 1 hutter2 hutter2 1048576 May 20 10:03 /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb
$ axl_cp -X bbapi /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb /p/gpfs1/hutter2/zero_1mb
AXL 0.3.0: butte20: Read and copied /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb to /p/gpfs1/hutter2/zero_1mb sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
$ ls -l /p/gpfs1/hutter2/zero_1mb
-rw------- 1 hutter2 hutter2 1048576 May 20 10:04 /p/gpfs1/hutter2/zero_1mb
I see the same failure on two of our IBM systems. I'm still investigating..
Issue 2, when I login to a machine for the first time and run my test, I always hit this error:
SCR v1.2.0: rank 0: Initiating flush of dataset 1
AXL 0.3.0 ERROR: AXL Error with BBAPI rc: -1 @ bb_check /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:131
AXL 0.3.0 ERROR: AXL Error with BBAPI details:
"error": {
"func": "queueTagInfo",
"line": "2903",
"sourcefile": "/u/tgooding/cast_proto1.5/CAST/bb/src/xfer.cc",
"text": "Transfer definition for contribid 436 already exists for LVKey(bb.proxy.44 (192.168.129.182),bef4bbc8-2214-4077-8264-75d8f993c296), TagID((1093216,835125249),2875789842327411), handle 4294969993. Extents have already been enqueued for the transfer definition. Most likely, an incorrect contribid was specified or a different tag should be used for the transfer."
},
If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with axl_cp
and see if the same thing happens.
If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with axl_cp and see if the same thing happens.
I'm not seeing it with axl_cp
. I used axl_cp -X bbapi
to copy a new file from SSD->GPFS right after logging in, and didn't get the transfer definition error. It's possible SCR/filo is exercising AXL in a more advanced way that causes the transfer error.
Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS with axl_cp -X sync
cp
and axl_cp -X bbapi
, and just to make sure there was no funny business, I also timed how long it took to md5sum the file afterwards:
type | copy | md5 |
---|---|---|
axl_cp -X sync | 1.8s | 32s |
cp | 1.8s | 32s |
axl_cp -X bbapi | 16.5s | 32s |
BBAPI traffic is throttled in the network, so I’m not surprised it’s slower... but that is quite slow.
It's possible that the copy was faster because the whole file was in the page cache. That would mean the regular copies were just reading the data from memory rather than the SSD. I believe BBAPI just does a direct device-to-device SSD->GPFS transfer, which would bypass the page cache. Unfortunately, I was unable to test with zeroed caches (echo 3 > /proc/sys/vm/drop_caches
) since I'm not root.
The 0B rank_0 file issue arises because it is a sparse file.
$ fiemap /mnt/bb_131fa614a608da727b038ed08e6eaad4/tmp/hutter2/scr.defjobid/scr.dataset.5/rank_0
ioctl success, extents = 0
Since BBAPI transfers extents, it would make sense that it would create a zero byte file for a sparse file. Proof:
# create one regular 1M file, and one sparse file
$ dd if=/dev/zero of=$BBPATH/file1 bs=1M count=1
$ truncate -s 1m $BBPATH/file2
# lookup extents
$ fiemap $BBPATH/file1
ioctl success, extents = 1
$ fiemap $BBPATH/file2
ioctl success, extents = 0
# both files appear to be 1MB to the source filesystem
$ ls -l $BBPATH/file1
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file1
$ ls -l $BBPATH/file2
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file2
# transfer them using BBAPI
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file1 /p/gpfs1/hutter2/
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file2 /p/gpfs1/hutter2/
# sizes after transfer
$ ls -l /p/gpfs1/hutter2/
total 1024
-rw------- 1 hutter2 hutter2 1048576 May 20 14:01 file1
-rw------- 1 hutter2 hutter2 0 May 20 14:02 file2
It gets really interesting when you create and transfer a partially sparse file:
$ echo "hello world" > /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ truncate -s 1M /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ ls -l /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
-rw------- 1 hutter2 hutter2 1048576 May 20 14:11 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ fiemap /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
ioctl success, extents = 1
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file3 /p/gpfs1/hutter2/
$ ls -l /p/gpfs1/hutter2/file3
-rw------- 1 hutter2 hutter2 65536 May 20 14:12 file3
$ cat /p/gpfs1/hutter2/file3
hello world
I then did another test where I created sparse section between two extents.
echo "hello world" > file4
truncate -s 1M file4
echo "the end" >> file4
When I transferred that file using the BBAPI, it was the correct size and had the correct data.
TL;DR: don't end your file with sparse sections, or they're not going to get transferred correctly with BBAPI.
Nice find, @tonyhutter ! Yeah, we should file that as a bug with IBM. It should be keeping the correct file size, even if sparse.
Can you create an issue describing the problem here: https://github.com/ibm/cast/issues
Can you create an issue describing the problem here: https://github.com/ibm/cast/issues
Another BBAPI observation:
I used test_ckpt
to create a 9GB checkpoint, and then killed it off when it was flushing the checkpoint from SSD to GPFS (using BBAPI). The process was killed, but I noticed the file kept transferring until it reached the size it should have been:
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 8053063680 May 20 17:15 rank_0
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 8321499136 May 20 17:15 rank_0
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 8589934592 May 20 17:15 rank_0
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 9395240960 May 20 17:15 rank_0
This makes sense, as the BBAPI deamon is going to keep transferring the file independently of the calling process. I also tried killing the process with a SIGSEGV (segfault) to simulate a job crashing, and a SIGKILL and saw the same thing - the file kept transferring until completion.
We could add an on_exit()
call to AXL to have it cancel all existing transfers if the process is killed. That would solve the issue for all non-SIGKILL terminations, which would be the common case.
I think that would be a good option. I can envision cases where AXL users would actually still want their transfer to complete even though their process has exited, so we'd probably want this to be configurable. Perhaps by individual transfer?
Someone has also asked whether AXL can delete their source files for them after the transfer. That might be another nice feature.
Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS
That is expected. BB transfers are intentionally throttled via InfiniBand QoS at ~0.6GBps per node to minimize network impacts to MPI and demand I/O. So 10GB should take around 16-17 seconds when transferring in the SSD->GPFS direction. The transfers in the GPFS->SSD direction are not governed and you should see near native SSD write speeds.
@tgooding out of curiosity, can the throttling be adjusted or disabled?
"Adjusting" can be done. Its a cluster wide parameter and would require changing the InfiniBand switch settings - not something you'd want to do often.
"Disabling" is easier. I/O on port 4420 is throttled, so you just need to change to use the other defined NVMe over Fabrics port (4421).
I think you can edit the configuration file /etc/ibm/bb.cfg
and change "readport" to 2. Then restart bbProxy.
Alternatively, you can pass in a custom config template via bbactivate --configtempl=<myaltconfig>
. The original/default template is at /opt/ibm/bb/scripts/bb.cfg
@tgooding thank you! :+1:
@adammoody regarding "should we cancel/not-cancel existing transfers on AXL restart", I forgot about this thread from around a year ago:
https://github.com/ECP-VeloC/AXL/issues/57
I'll put my comments in there since it already has a lot of good discussion.
Regarding:
See: https://github.com/ECP-VeloC/AXL/issues/57#issuecomment-634226157 .
TL;DR: sync/pthread transfers get killed with the job automatically, and I don't see how we could cancel BBAPI transfers.
@tonyhutter : if you haven't, look at using BB_GetTransferList() at SCR startup time for getting the active handles.
@tonyhutter , we also have a working test case in which a later job step cancels the transfer started by the previous job step. It's in the LLNL bitbucket, so I'll point you to that separately.
@tgooding according to the documentation, BB_GetTransferList()
only returns the transfer handles for a specific jobid/jobstepid.
/**
* \brief Obtain the list of transfers
* \par Description
* The BB_GetTransferList routine obtains the list of transfers within the job that match the
* status criteria. BBSTATUS values are powers-of-2 so they can be bitwise OR'd together to
* form a mask (matchstatus). For each of the job's transfers, this mask is bitwise AND'd
* against the status of the transfer and if non-zero, the transfer handle for that transfer
* is returned in the array_of_handles.
*
* Transfer handles are associated with a jobid and jobstepid. Only those transfer handles that were
* generated for the current jobid and jobstepid are returned.
*
* \param[in] matchstatus Only transfers with a status that match matchstatus will be returned. matchstatus can be a OR'd mask of several BBSTATUS values.
* \param[inout] numHandles Populated with the number of handles returned. Upon entry, contains the number of handles allocated to the array_of_handles.
* \param[out] array_of_handles Returns an array of handles that match matchstatus. The caller provides storage for the array_of_handles and indicates the number of available elements in numHandles. \note If array_of_handles==NULL, then only the matching numHandles is returned.
* \param[out] numAvailHandles Populated with the number of handles available to be returned that match matchstatus.
* \return Error code
* \retval 0 Success
* \retval errno Positive non-zero values correspond with errno. strerror() can be used to interpret.
* \ingroup bbapi
*/
extern int BB_GetTransferList(BBSTATUS matchstatus, uint64_t* numHandles, BBTransferHandle_t array_of_handles[], uint64_t* numAvailHandles);
https://github.com/IBM/CAST/blob/master/bb/include/bbapi.h#L261
If a job is restarted it's going to get a new jobid, so I don't see how you could use it to get the list of old transfers.
Some more observations from testing:
I was able to cancel a transfer across job IDs by saving the transfer handle and cancelling it. So we would need to save this state somewhere. This is going to be an issue with AXL, as saving the state is optional (via the optional save_file
passed to AXL_Init()
).
I confirmed that BB_GetTransferList()
will only list the transfers for your job id. So if your job dies and you re-launch, you can't use BB_GetTransferList()
to find your old transfers and cancel them. I assume this is some sort of access control policy.
Cancelling a BB API transfer is not immediate. A 25GB copy from the burst buffer to GPF takes ~1min 30sec, and BB_CancelTransfer()
the takes ~30sec to cancel that transfer. Interestingly, it also took ~30s to cancel a 50GB transfer, but only 13s to cancel a 10GB transfer.
If we have multiple transfers to cancel, we should cancel them in parallel in separate threads, as BB_CancelTransfer()
will block until the transfer is cancelled.
Regrading:
Check that SCR manages any final BBAPI transfer that is completed at the end of an allocation. In particular, SCR normally runs some finalization code to register that a transfer completed successfully, so that it knows the corresponding checkpoint is valid to be used during a restart in the next job allocation. I don't think we're accounting for that right now.
I ran some tests using test_api
to see what would happen if I corrupted a checkpoint file. If I changed the file size, it would detect that it was corrupted and correctly fall back to an earlier checkpoint. If I wrote random data to the checkpoint but kept the same file size, it caught the corruption, but only because there's a check within test_api.c itself. When I removed that manual check to see if SCR would internally detect the corruption, it did not detect it.
This is important with BB API transfers, as there is a period of time in the transfer when the destination file will be the ending size, but all the data isn't present. I saw this by stating the file while it was being transferred (look at the block count):
# start transfer in the background
$ axl_cp -X bbapi $BBPATH/bigfile /p/gpfs1/hutter2/bigfile &
$ while [ 1 ] ; do sleep 1 && stat /p/gpfs1/hutter2/bigfile | grep Size ; done
Size: 21206401024 Blocks: 34045952 IO Block: 16777216 regular file
Size: 21206401024 Blocks: 34766848 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 37388288 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 39256064 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41254912 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
exit_func: exiting, status 0, id 1
[1]+ Done rm -f /p/gpfs1/hutter2/bigfile && ~/AXL/build/test/axl_cp -X bbapi $BBPATH/bigfile /p/gpfs1/hutter2/bigfile
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Size: 21474836480 Blocks: 41943040 IO Block: 16777216 regular file
Thanks for exploring the space of the BB, @tonyhutter . This is great.
We used to compute a CRC32 for each file when it was copied to the parallel file system, and we stored that in the SCR metadata. When reading the file back, we computed the CRC32 again and checked against the stored value. We dropped that feature when moving to the components, but it'd be nice to have back.
Even so, SCR should avoid fetching a checkpoint that did not finalize. SCR assumes a checkpoint is bad unless it has explicitly marked it as complete by writing a flag into the .scr/index.scr file. It does that after the flush completes. In the case of an async flush, that happens here: https://github.com/LLNL/scr/blob/d0777f85539d2b88e5bcb4987aa690e6ebdbef5e/src/scr_flush_async.c#L225 then here: https://github.com/LLNL/scr/blob/d0777f85539d2b88e5bcb4987aa690e6ebdbef5e/src/scr_flush.c#L364 The SCR library has to be running for this to work. What's new with the BB is that this transfer will continue even after the application has stopped. If it errors out, then we're ok because SCR will not try to read it, since it never marked it as complete.
A more interesting case is what happens if it actually succeeds? In that case, it'd be nice to have something that then updates the SCR index file to indicate that the checkpoint is actually good. The BB lets us register a script that runs in poststage. We could create a custom script that runs in poststage and then updates the index file if it finds the checkpoint succeeded.
Actually, browsing through that code, I need to double check that it's properly setting that flag. It's not obvious to me, so it may have been lost in translation. Anyway, that is what it is supposed to be doing.
Ok, before starting the flush, SCR writes a COMPLETE=0 flag: https://github.com/LLNL/scr/blob/d0777f85539d2b88e5bcb4987aa690e6ebdbef5e/src/scr_flush.c#L318 Then it only updates that flag after the flush finishes. So we should be covered.
Though this TODO suggests the code that writes the complete marker needs to do some more checking: https://github.com/LLNL/scr/blob/d0777f85539d2b88e5bcb4987aa690e6ebdbef5e/src/scr_flush.c#L245 We don't catch the case right now if the flush completes but some process detected an error.
One more thing. Though SCR is meant to rely on its index file to avoid reading back a partially flushed checkpoint, other AXL users might appreciate built-in support to better handle this. Perhaps AXL could transfer the file without the read bit enabled, and then flip the read bit on after it completes. Or AXL could flush to a temporary file name and then rename the file.
Or AXL could flush to a temporary file name and then rename the file.
I like this idea, it's simple and reliable. We'd probably want to tack on an ._AXL
extension so to the filename instead of using a temp name, like:
ckpt_0._AXL
That way, it gives a hint to the user that "my ckpt_0
file is still being transferred by AXL" and they could ls -l
it to track progress.
Opened AXL issue for the file rename: https://github.com/ECP-VeloC/AXL/issues/66
So, I think I now have some answers to the original questions:
Test using AXL with pthreads to transfer checkpoints. Does it work correctly? Do we need to add logic to throttle the transfer so it doesn't interfere with application performance? (possibly not, but just want to check to see what the performance is like)
I tested SCR w/AXL-pthreads transfers and they work. They shouldn't adversely affect application performance since AXL pthreads already self-limits the number of transfer threads to the lesser of:
MAX_THREADS
(currently 16)So AXL pthreads could only ever use a maximum of 16 threads. Additionally, the CPU is going to task switch when waiting on IOs to finish, so it's not like AXL would using those CPUs 100% of the time.
The 16 thread number was chosen because transfer performance really doesn't increase after 16 threads (at least in the limited testing I did). Here are some benchmarks I did back when I was testing pthreads:
Time to transfer 200 files with files sizes ranging between 1-50MB:
Threads | Time (seconds) |
---|---|
1 | 4 |
2 | 3 |
4 | 2.6 |
8 | 2.4 |
16 | 2.4 |
32 | 2.4 |
Time to transfer 20,000 files with files sizes ranging between 1-50KB:
Threads | Time (seconds) |
---|---|
1 | 16.5 |
2 | 12.2 |
4 | 9.6 |
8 | 8.5 |
16 | 8 |
32 | 7.9 |
Does the initiation and completion check for transfers work correctly? (at small and large scales?)
In the common case yes. SCR will not load a checkpoint that isn't the right size or not marked internally within SCR as complete. If the file is corrupted after being successfully transferred, SCR will not detect it, but that's really not something SCR should be expected to do.
Need to be able to cancel transfer of files on restart. Requires tracking set of transfer handles used (with redundancy to account for failures). Ensure everything was cancelled successfully in SCR_Init during restart. Ensure final transfers are cancelled or complete in the case of scavenge/post-run transfers.
Sync and pthreads transfers will be cancelled when an application dies, since it is the application's own threads doing the transfer.
BB API will continue the transfer after the application dies, and there's currently no easy way to cancel those transfers on job restart. In order to cancel a BB transfer on application restart, we would need to keep track of the BB API transfer handle across jobs. We don't currently do this. It's possible we could embed the transfer handles into the AXL state_file, but since the state_file is technically not required, it's not a great solition. A better solution would be to modify the BB API to allow BB_GetTransferList()
to return a list of all the user's transfer handles, rather than artificially limiting it to the current job ID's handles. I opened an issue with IBM requesting this change:
https://github.com/IBM/CAST/issues/922
Check that SCR manages any final BBAPI transfer that is completed at the end of an allocation. In particular, SCR normally runs some finalization code to register that a transfer completed successfully, so that it knows the corresponding checkpoint is valid to be used during a restart in the next job allocation. I don't think we're accounting for that right now.
SCR does detect if a checkpoint file size is correct, and not load a previous checkpoint if it is not.
According to @adammoody (https://github.com/LLNL/scr/issues/163#issuecomment-635036312) SCR internally doesn't mark a transfer as complete unless it finishes flushing. This means that if the checkpoint was the right size, but the filo never actually said "I'm done transferring the file", then it won't load the checkpoint (Note: I didn't actually test this). This is a good thing. It should get around an issue with BBAPI transfers where the destination file is the correct final file size, but all the blocks hadn't finished transferring (see https://github.com/LLNL/scr/issues/163#issuecomment-635018484)
Overall, file transfer integrity should be done at a lower level like FILO or AXL. Perhaps we could add an AXL_Checksum(int id)
function that the user could optionally call after AXL_Wait()
? The nice thing about doing it in AXL is that we could do the checksum in parallel, since AXL is already pthreads-aware.
Additionally, we proposed increasing file transfer integrity by having AXL use a temporary file name while it's transferring (https://github.com/ECP-VeloC/AXL/issues/66)
Great progress, @tonyhutter . On the application interference question, we also need to do some "noise" tests at medium or large scale perhaps during a DST after a system update. One thing we want check is whether a background transfer slows down the application.
To do this, we can put together an "app" that does MPI_Allreduce(1 double) in a tight loop. We then measure the time time it takes to complete some number of those both with and without a background transfer. It's simple to recreate, but we have a "rednoise" benchmark we can use for this.
Some background info... For tracking transfers to either cancel them during restart or wait on them to complete or error out within a poststage script, we've made this problem harder on ourselves when we moved to components. It used to be that both Filo and AXL were all part of the monolithic SCR library. In that case, SCR knew the full set of files that had to be transferred, and it was easy to treat that transfer as one logical operation.
While co-designing the BB, we asked IBM to support shared transfer handles. That enables multiple processes, which may even be running on different nodes, to all individually register their own files as part of one larger shared transfer operation. In that case, one process like rank 0 allocates a single handle value, broadcasts it to all processes, and then all processes register their files under that single handle id. You can then query the status or cancel the entire transfer with a single integer id value.
The plan was to record that single value somewhere safe, like replicate it on each node or store it on the parallel file system. With that, it would then be trivial to cancel a transfer on restart. We just need to make sure rank 0 can get the id back and then just rank 0 executes the cancel while everyone else waits in a barrier.
This suddenly got much harder now that we have SCR / Filo / AXL.
While we're on it, I keep thinking we should reevaluate whether we should move Filo back into SCR or merge it into AXL as an MPI option. That might help solve some of this. For example, at least Filo understands there is a wider set of files across processes and we might be able to use a single handle id again. That's assuming shared handle performance is acceptable, which we'd want to retest before making that switch. The other option is to record the full set of handle ids somewhere so that we can get them all back. An MPI gather would make that easier, too.
Another thing that comes to mind... Filo is called by every process in MPI_COMM_WORLD during a flush or fetch, and every Filo instance invokes AXL. If the user is running P processes per node, each compute node then runs P AXL instances in parallel. On some machines, people run one MPI process per CPU core, so you can easily get tens of processes per node today.
We may do better if Filo can use a single AXL instance per node. For pthreads, we'd reduce the number of threads we create on the node, and for BB, we'd cut back on the number of handles. To do this, Filo would have to identify the set of ranks that can share an AXL instance and delegate the work to one Filo process. We can determine whether that's worth the effort by testing.
On some machines, people run one MPI process per CPU core, so you can easily get tens of processes per node today.
Keep in mind that AXL is only going to spawn 16 threads if there's >= 16 files to transfer in a single transfer. The user can always switch it to sync or some other transfer type, or do transfers in smaller batches of files at a time to limit the number of threads.
Jitter test
I created 3000 files with random sizes between 0-9MB on the SSD (around 14GB) and copied them to GPFS. While they were copying, I then used a utility that @adammoody's gave me to measure the jitter between a MPI_Allreduce()
call. The utility sampled for 20,000,000 iterations (about 6 sec). Note that I had to run the bbapi copy using only 1000 files to get around an error ("Too many open files").
Ticks | sync | pthreads | bbapi |
---|---|---|---|
0-9: | 0 | 0 | 0 |
10-19: | 0 | 0 | 0 |
20-29: | 0 | 0 | 0 |
30-39: | 0 | 0 | 0 |
40-49: | 0 | 0 | 0 |
50-59: | 0 | 0 | 0 |
60-69: | 0 | 0 | 0 |
70-79: | 0 | 0 | 0 |
80-89: | 0 | 0 | 0 |
90-99: | 0 | 0 | 0 |
100-109: | 799515 | 701805 | 736913 |
110-119: | 19030830 | 19107556 | 19088667 |
120-129: | 165606 | 143548 | 170098 |
130-139: | 1426 | 1500 | 1089 |
140-149: | 884 | 858 | 879 |
150-159: | 430 | 781 | 621 |
160-169: | 149 | 39395 | 302 |
170-179: | 91 | 344 | 113 |
180-189: | 32 | 210 | 51 |
190-199: | 18 | 212 | 32 |
200-209: | 8 | 310 | 28 |
210-219: | 11 | 243 | 32 |
220-229: | 9 | 238 | 40 |
230-239: | 9 | 1510 | 92 |
240-249: | 9 | 324 | 47 |
250-259: | 12 | 61 | 29 |
260-269: | 12 | 48 | 16 |
270-279: | 7 | 35 | 13 |
280-289: | 7 | 33 | 12 |
290-299: | 2 | 27 | 8 |
300-309: | 5 | 17 | 1 |
310-319: | 1 | 7 | 2 |
320-329: | 0 | 11 | 2 |
330-339: | 2 | 8 | 1 |
340-349: | 3 | 2 | 0 |
350-359: | 0 | 5 | 1 |
360-369: | 1 | 1 | 1 |
370-379: | 1 | 5 | 1 |
380-389: | 2 | 4 | 4 |
390-399: | 0 | 3 | 14 |
400-409: | 1 | 3 | 19 |
410-419: | 1 | 2 | 12 |
420-429: | 3 | 2 | 6 |
430-439: | 2 | 7 | 1 |
440-449: | 1 | 3 | 2 |
450-459: | 2 | 0 | 0 |
460-469: | 6 | 6 | 2 |
470-479: | 5 | 2 | 4 |
480-489: | 12 | 8 | 8 |
490-499: | 15 | 16 | 19 |
500 | 870 | 850 | 818 |
There's not a huge difference in the results. This is on one of our beefy IBM nodes with a burst buffer and a ton of cores.
Cool! Glad to see you got that working already, @tonyhutter . That was fast.
For single node tests, the fwqmpi benchmark will be of most interest. In the past, we used to run this with one MPI rank per core, since that was how apps often ran. I think we should still do that for lassen, but we might also run one rank per GPU, which is more typical of CORAL apps.
This benchmark has a perl script you can use to generate a graph, where each core generates a scatter plot of points.
The last we tried this, we ran on cab and had 16 cores/node. Lassen is now at 40-42, so not sure how well all of this will hold up.
Also, it'd be helpful to collect results with no transfer to server as a baseline.
@kathrynmohror and @adammoody have identified some additional things we should test/verify with SCR: