Closed tonyhutter closed 4 years ago
Looks good to me. Thanks @tonyhutter .
@tonyhutter , would you please update this PR to use 4-space indents and add braces even for single-line if/while blocks throughout?
Can you remind me, what is the plan to lookup the BB transfer ID for a cancel?
I thought I had that straight, but now I'm forgetting again. The AXL user will call either AXL_Cancel(id)
or AXL_Stop()
.
Sorry about the indents - I've fixed those in my latest push.
Correct, it's AXL_Stop()
or AXL_Cancel(id)
in the normal case. For old transfers, the plan is to add a new function to AXL to deal to cleaning them up. We would either update AXL_Finalize()
-> AXL_Finalize(char *path)
or add a new AXL_Cleanup(char *path)
, where path
is an individual file or a directory of files to cleanup. That function would remove old transfer files with the ._AXL extension, or for BBAPI, it would check the transfer status for the file, and rename the file to its final name or delete it depending on the BB transfer status (success or failure). This is to be done in a follow-on PR.
What do you mean by "normal case" vs "old transfers"?
normal case - your application calls AXL_Cancel(id)
to purposefully cancel an ongoing transfer that it started.
old transfers - your application starts a transfer, the app crashes and restarts, and now your app is dealing with your old transfer files (the ones with the ._AXL
extensions)
Ok, yes, we'll need to think through our use cases with the cleanup API.
On a related topic, the reason the AXL state file exists is so that one can call Cancel and Stop on transfers that were initiated before the process that is making the Cancel/Stop call. AXL should record whatever it needs in this file to maintain its state through process failure.
The plan was for SCR's integration to look like this:
<run 1 starts>
AXL_Init(/path/to/state/file)
<define AXL transfer>
<AXL records info about that transfer in its state file>
<start AXL transfer>
<run 1 dies>
<run 2 starts>
AXL_Init(/path/to/state/file)
<AXL reads in its state file, which contains info about all transfer that might be ongoing>
AXL_Stop()
The state file can record the mapping of AXL transfer id to things like the list of source and destination files associated with that id, the metadata for each of those files, the transfer method being used, and any vendor-specific information like the BB transfer handle. With this, the application can use any or all of Test/Wait/Cancel/Stop after being restarted, but to do so, the application is required to provide a path to a file where AXL can store its state.
With this, the application can use any or all of Test/Wait/Cancel/Stop after being restarted, but to do so, the application is required to provide a path to a file where AXL can store its state.
Sure, you could do anAXL_Stop()
, but all the other functions require an AXL ID. You'd need to add something to the AXL API for the user to recover those IDs.
Alternatively, if they pass a state_file in AXL_Init()
, you could simply do:
That's right. With the current AXL API, the caller of AXL is responsible for saving the AXL ID values somewhere if it wants to be able to use them across process failure.
Right now, SCR just calls AXL_Stop on a restart on every restart. But that's overkill.
From past experience, we found that a significant number of failures just kill the application and otherwise leave all compute nodes intact (30%). For those, the restarted job lands on the same nodes, all files exist, and there is no need to move files around. In that case, there is no need to actually stop the ongoing transfers.
A nice optimization that we could add to SCR is to allow the transfer to continue/restart whenever SCR identifies that all source files are still valid, which changes the above to this.
<run 2 starts>
AXL_Init(/path/to/state/file)
<AXL reads in its state file, which contains info about all transfer that might be ongoing>
SCR determines whether any source file must be rebuilt or shuffled
if so
AXL_Stop() // stop all transfers
else
AXL_Wait(id) // just let the previous transfer finish
Perhaps we want an AXL_Restart(id)
or something to make things work better in the general case like pthreads or AXL could just kick off the new pthread transfer when it sees an existing, unfinished pthread transfer in its state file during AXL_Init
.
I think we have like 80% of what we need to support this test/wait/cancel/stop after a restart already working in AXL, but the requirements for that last 20% are not well defined yet. We'll need to work from the SCR side for a while to figure out where the gaps are. I can pitch in on the SCR side by later next week.
I opened #76 for continued discussion on the state_file.
This patch will add an temporary AXL extension to the file while it is copying. The extension is removed after the file copy is complete. For example, if you're copying to
/tmp/file1
, AXL will actually copy to/tmp/file1._AXL
, and then rename to/tmp/file1
at the end of the transfer.Additionally, a BB API transfer will also encode the transfer handle number into the extension like:
/tmp/file1._AXL13456
That way we can later recover the transfer handle number.
Fixes: #66