Check destination file sizes before resuming transfers

tonyhutter commented 3 years ago

This changes AXL_Dispatch() to first check the destination file sizes before resuming the copy. That way, if the files were already fully transferred at the time of resume, and the source files are gone (as would be the case with a SCR post-stage), then simply finalize the destination files (remove their ._AXL extension, apply metadata, etc).

adammoody commented 3 years ago

Even in post-stage, we need to call the BBAPI to determine whether the transfer succeeded. Ideally, our normal Create/Resume/Wait logic should handle that.

Where do things break down if we don't do this?

tonyhutter commented 3 years ago

You're right, I'll move this block:

            rc = axl_check_file_sizes(id);
            if (rc == AXL_SUCCESS) {
               /* Destination files are already the correct size, we're done */
               kvtree_util_set_int(file_list, AXL_KEY_STATUS, AXL_STATUS_DEST);
               goto end;
            }

...to both axl_sync_resume() & axl_pthread_resume(), since in those cases the file size will always tell us if the transfer is complete or not.

Note that for SCR post_stage, we're finalizing the transfer from the login node, which isn't running the BB client, and thus can't check the BB transfer status. To get around this, poststage will do a AXL_XFER_SYNC resume on the old BB transfer, which just does the simple file size check and finalization. This could be problematic if the file size was correct, but the BB status wasn't "BBFULLSUCCESS". I'm not sure how you'd get around this though, short of running the BB client on the login node, or doing some ulgy pdsh call to a BB node to get the transfer status, or doing a CRC check on the file.

adammoody commented 3 years ago

A couple of things.

This could be problematic if the file size was correct, but the BB status wasn't "BBFULLSUCCESS".

Yeah, that's exactly the point I'm worried about. There is no guarantee that a correct file size implies the file contents are correct. We have no control over how the BB software actually moves the bytes of a file. They may truncate to the correct size and then fill in the contents, or they may write the file in arbitrary order. Even if they tell us we could use the size in their implementation today, they might drop a software release tomorrow that breaks it. The agreement with IBM was that we'd check the BB transfer status.

Also, the poststage script won't technically run on a login node -- it runs on the job script node. I don't remember if the BBAPI is valid from the job script node, but I think the bbcmd is meant to work from the job script node. Somewhere in the IBM reports, I think they have example poststage scripts that wait on and check the status of a transfer. We'll need to copy what they did. We might have tested that in Ben's work, too.

adammoody commented 3 years ago

In a quick search, here is the best example phase 2 stageout script I've seen so far:

https://github.com/IBM/CAST/blob/master/bb/scripts/stageout_user_phase2_bscfs.pl

This queries for the list of transfer handles and checks the status of each one.

tonyhutter commented 3 years ago

Ah good, if the BBAPI is available in post-stage then we probably don't even need this PR. I'll do a quick sanity test.

adammoody commented 3 years ago

@tonyhutter , I haven't tried the above bscfs script, but it should be a close fit to what we need. Does a modified version of that work as far as getting the set of transfer handles and the status for each one?

tonyhutter commented 3 years ago

@adammoody ideally I'd rather not do that, as you'd have AXL spawning off a perl script to test the transfer status, which is ugly.

If it turns out we're unable to check the transfer status in poststage, the we could checksum the source files, add the checksums to the state_file, and then verify the checksums in post_stage. Hopefully it doesn't come to that...

tonyhutter commented 3 years ago

I'm finding that bbcmd may be doing some things outside of the BBAPI to make it work on the post-stage node. I've opened:

Can't connect to bbproxy when calling BB_InitLibrary() in post-stage https://github.com/IBM/CAST/issues/1002

adammoody commented 3 years ago

Yeah, I seem to remember that it's not valid to call the BBAPI from the launch nodes. I think the BBAPI is only valid from compute nodes. One has to use bbcmd from the launch nodes. Though I forget why.

tonyhutter commented 3 years ago

@adammoody yea, Tom Gooding just confirmed that in the bug I opened. I asked him if it was sufficient to just check the file sizes and am waiting on an answer. That is, does the destination file size reported by the BBAPI/GPFS always represent valid data, or can the data be wrong if there was an error in the transfer or something.

adammoody commented 3 years ago

Let's use bbcmd to check the transfer status. That was the plan we worked up with IBM when we co-designed this all with them.

tonyhutter commented 3 years ago

@adammoody ok we'll spawn bbcmd and scrap the output.

tonyhutter commented 3 years ago

We can probably close this PR, since we're now spawning off bbcmd to check the final transfer status when resuming BBAPI transfers.

ECP-VeloC / AXL

Check destination file sizes before resuming transfers #95