ECP-VeloC / AXL

Asynchronous Transfer Library
MIT License
2 stars 8 forks source link

Use temporary file name when transferring #73

Closed tonyhutter closed 4 years ago

tonyhutter commented 4 years ago

This patch will add an temporary AXL extension to the file while it is copying. The extension is removed after the file copy is complete. For example, if you're copying to /tmp/file1, AXL will actually copy to /tmp/file1._AXL, and then rename to /tmp/file1 at the end of the transfer.

Additionally, a BB API transfer will also encode the transfer handle number into the extension like:

/tmp/file1._AXL13456

That way we can later recover the transfer handle number.

Fixes: #66

adammoody commented 4 years ago

Looks good to me. Thanks @tonyhutter .

adammoody commented 4 years ago

@tonyhutter , would you please update this PR to use 4-space indents and add braces even for single-line if/while blocks throughout?

adammoody commented 4 years ago

Can you remind me, what is the plan to lookup the BB transfer ID for a cancel?

I thought I had that straight, but now I'm forgetting again. The AXL user will call either AXL_Cancel(id) or AXL_Stop().

tonyhutter commented 4 years ago

Sorry about the indents - I've fixed those in my latest push.

Correct, it's AXL_Stop() or AXL_Cancel(id) in the normal case. For old transfers, the plan is to add a new function to AXL to deal to cleaning them up. We would either update AXL_Finalize() -> AXL_Finalize(char *path) or add a new AXL_Cleanup(char *path), where path is an individual file or a directory of files to cleanup. That function would remove old transfer files with the ._AXL extension, or for BBAPI, it would check the transfer status for the file, and rename the file to its final name or delete it depending on the BB transfer status (success or failure). This is to be done in a follow-on PR.

adammoody commented 4 years ago

What do you mean by "normal case" vs "old transfers"?

tonyhutter commented 4 years ago

normal case - your application calls AXL_Cancel(id) to purposefully cancel an ongoing transfer that it started. old transfers - your application starts a transfer, the app crashes and restarts, and now your app is dealing with your old transfer files (the ones with the ._AXL extensions)

adammoody commented 4 years ago

Ok, yes, we'll need to think through our use cases with the cleanup API.

On a related topic, the reason the AXL state file exists is so that one can call Cancel and Stop on transfers that were initiated before the process that is making the Cancel/Stop call. AXL should record whatever it needs in this file to maintain its state through process failure.

The plan was for SCR's integration to look like this:

<run 1 starts>
AXL_Init(/path/to/state/file)
<define AXL transfer>
<AXL records info about that transfer in its state file>
<start AXL transfer>
<run 1 dies>

<run 2 starts>
AXL_Init(/path/to/state/file)
<AXL reads in its state file, which contains info about all transfer that might be ongoing>
AXL_Stop()

The state file can record the mapping of AXL transfer id to things like the list of source and destination files associated with that id, the metadata for each of those files, the transfer method being used, and any vendor-specific information like the BB transfer handle. With this, the application can use any or all of Test/Wait/Cancel/Stop after being restarted, but to do so, the application is required to provide a path to a file where AXL can store its state.

tonyhutter commented 4 years ago

With this, the application can use any or all of Test/Wait/Cancel/Stop after being restarted, but to do so, the application is required to provide a path to a file where AXL can store its state.

Sure, you could do anAXL_Stop(), but all the other functions require an AXL ID. You'd need to add something to the AXL API for the user to recover those IDs.

Alternatively, if they pass a state_file in AXL_Init(), you could simply do:

  1. For non-BB_API transfers, remove all the old ._AXL files in the transfer list.
  2. For BB API transfers, check the BB API transfer status.
    • If it reports a successful transfer, rename the file to it's final filename, since it must have transferred in the background between jobs.
    • If it reports the transfer is still in progress, cancel the transfer and delete the ._AXL file.
    • If it reports any other status, delete the ._AXL file.
adammoody commented 4 years ago

That's right. With the current AXL API, the caller of AXL is responsible for saving the AXL ID values somewhere if it wants to be able to use them across process failure.

adammoody commented 4 years ago

Right now, SCR just calls AXL_Stop on a restart on every restart. But that's overkill.

From past experience, we found that a significant number of failures just kill the application and otherwise leave all compute nodes intact (30%). For those, the restarted job lands on the same nodes, all files exist, and there is no need to move files around. In that case, there is no need to actually stop the ongoing transfers.

A nice optimization that we could add to SCR is to allow the transfer to continue/restart whenever SCR identifies that all source files are still valid, which changes the above to this.

<run 2 starts>
AXL_Init(/path/to/state/file)
<AXL reads in its state file, which contains info about all transfer that might be ongoing>

SCR determines whether any source file must be rebuilt or shuffled
if so
  AXL_Stop() // stop all transfers
else
  AXL_Wait(id) // just let the previous transfer finish

Perhaps we want an AXL_Restart(id) or something to make things work better in the general case like pthreads or AXL could just kick off the new pthread transfer when it sees an existing, unfinished pthread transfer in its state file during AXL_Init.

adammoody commented 4 years ago

I think we have like 80% of what we need to support this test/wait/cancel/stop after a restart already working in AXL, but the requirements for that last 20% are not well defined yet. We'll need to work from the SCR side for a while to figure out where the gaps are. I can pitch in on the SCR side by later next week.

tonyhutter commented 4 years ago

I opened #76 for continued discussion on the state_file.