Expected behavior of AXL after being killed

tonyhutter commented 5 years ago

Let's say you SIGKILL or crash an app in the middle of an AXL transfer, and then reload it with the same state file (via AXL_Init), what is the expected behaviour? Should the failed transfer be resumed as part of the call to AXL_Init? Should AXL_Init return an error and do nothing?

I started writing a test case for this but didn't know what to expect.

gonsie commented 5 years ago

from today's meeting, if AXL Init notices that it is restarting (i.e., a previous state file exists) it should do nothing / have no smarts. It would be up to the orchestrating layer above AXL to call stop or resume.

tonyhutter commented 5 years ago

So basically applications calling AXL should do this:

<app is running>
<crash>
<re-run app>
    AXL_Init(state_file)
    id = AXL_Create(xtype, name) //should give you back your old ID
    AXL_Cancel(id)
    AXL_Free(id)

gonsie commented 5 years ago

For apps that didn't want to deal with ID's it could be:

AXL_Init(state_file);
AXL_Stop();

but we have to be clear that this stops all transfers and isn't safe if AXL is being used by 2 separate components.

adammoody commented 5 years ago

FYI, here is where SCRV stops flushes during SCR_Init:

SCR_Init() --> scr_flush_async_stop(): https://github.com/LLNL/scr/blob/82a3c43d8243c61fac331b64539535f6cdfa3961/src/scr.c#L1645

scr_flush_async_stop() --> Filo_Flush_stop(): https://github.com/LLNL/scr/blob/82a3c43d8243c61fac331b64539535f6cdfa3961/src/scr_flush_async.c#L37

Filo_Flush_stop() --> filo_axl_stop --> AXL_Stop: https://github.com/ECP-VeloC/filo/blob/4c1bd39aae0208b17061a772ed05d80e07f38f58/src/filo.c#L953

tonyhutter commented 5 years ago

Now that I'm digging into the code I see there's no way to get your old ID back. So my example in https://github.com/ECP-VeloC/AXL/issues/57#issuecomment-501430547 doesn't work. Here's what would happen:

<app is running with id = 1>
<crash>
<re-run app>
    AXL_Init(state_file)
    id = AXL_Create(xtype, name) //gives you a new id of 2
    AXL_Cancel(id)
    AXL_Free(id)

So in reality the application would have to remember it's id from the time it crashed, and do this:

    AXL_Init(state_file)
    AXL_Cancel(old_id)
    AXL_Free(old_id)

This is a problem, as I'm guessing most apps are not going to diligently save their id to disk so that they can do a proper cleanup in case they crash.

One solution is to update AXL_Create(xtype, name) to cancel/cleanup any transfers with the same name that are not associated with a running process. That's way, the user doesn't have to do anything special to cleanup after a crash. They just restart the app (with the same name), and things get automatically cleaned up. We'd need to store the caller's process id in the kvtree when we call AXL_Create so that we could later use it to detect if the old process is still running. Yes, it's possible the PIDs can wrap and you could have some random process that has the same PID as a defunct transfer, but in that case the old transfer simply doesn't get immediately cleaned up, which isn't a big deal.

Alternatively, we can do the same "cancel/cleanup all transfers for dead processes" as part of AXL_Init(). The upside of this is that we can guarantee it will clean up all old transfers. The downside to that is that the caller incurs the start-up penalty for cancelling/cleaning up all transfers, including transfers not associated with it.

tonyhutter commented 4 years ago

Alternatively, we can do the same "cancel/cleanup all transfers for dead processes" as part of AXL_Init(). The upside of this is that we can guarantee it will clean up all old transfers. The downside to that is that the caller incurs the start-up penalty for cancelling/cleaning up all transfers, including transfers not associated with it.

Additional info after some digging for https://github.com/LLNL/scr/issues/163:

We don't need to cleanup transfers for sync/pthreads since they only transfer when the process is running. If the process is killed, the transfer is killed with it.
BBAPI transfers run independently of AXL. So if you kill off an AXL process, it won't kill the transfer (I verified this experimentally). We could look up the BB transfer handle if the user passed a state_file to AXL_Init, but I don't know if anyone is actually doing that. Even if we know the BB transfer handle, I don't know if BBAPI would let us cancel the transfer if we didn't have the same same job ID and jobstep ID. I say that because BB_GetTransferList() only returns the list of transfers for the current job/jobstep ID. So if you ran a job, it got killed, and you re-ran it, you'd have a new job ID, and couldn't look up the old transfer you initiated. It sounds like BBAPI may be using job/jobstep ID as an ACL mechanism (so you don't kill other people's transfers).

TL;DR: I don't see the need to cancel transfers for sync/pthread, and I don't see how we could cancel them for BBAPI.

tonyhutter commented 4 years ago

Questions for IBM related to this: https://github.com/IBM/CAST/issues/920

tonyhutter commented 4 years ago

I'd like to propose one possible solution to the "how do I cancel BB API transfers if my job restarts" problem:

Imagine your using AXL to copy a file called checkpoint1 from the SSD to GPFS:

axl_cp -X bbapi /dev/bb/checkpoint1  /p/gpfs/hutter/checkpoint1

What you could do is first copy it to a temp file called /p/gpfs/hutter/checkpoint1._AXL-123456 where 123456 is the transfer handle. After the transfer is done, AXL renames it to checkpoint1. We've already talked about wanting to do file renames after completion in https://github.com/ECP-VeloC/AXL/issues/66.

Now imagine the job is cancelled halfway though the transfer. We'd still have a /p/gpfs/hutter/checkpoint1._AXL-123456 file being transferred. Your job restarts and attempts to write /p/gpfs/hutter/checkpoint1 again using AXL. However, before writing the file, AXL first sees if there's an existing checkpoint1._AXL-* file there. It sees the old file, it parses the filename to get the transfer handle and cancels the transfer (and deletes the old file).

Alternatively, instead of using a temporary filename to store the transfer handle, you could store it in as xattr in checkpoint1.

The downside of this whole solution is that you'd have to do some extra metadata calls to destination before the transfer starts. It also doesn't solve the problem of cancelling all old transfers, it just cancels the ones that you attempt to transfer again, which may be good enough.

adammoody commented 4 years ago

Attaching the handle id to a temporary file name is a clever idea.

There is a race condition we'd need to worry about where the application could die after it submits the transfer and then be restarted before the BB has started the transfer. In that case, the file would not exist when the process restarts and looks to cancel it.

A possible work around to that would be to have the process touch the file before it even submits the transfer, but then we have to wait on the file system to create all of the inodes before we can initiate the transfer. I also don't know if this work around eliminates the race or just reduces the window of exposure.

Having said that, I do think this idea would be helpful to other users. Let's keep it in mind.

tonyhutter commented 4 years ago

There is a race condition we'd need to worry about where the application could die after it submits the transfer and then be restarted before the BB has started the transfer. In that case, the file would not exist when the process restarts and looks to cancel it.

The race is possible, but unlikely. Keep in mind that the new process is going to have to do several BB API calls for setup before it even starts transferring it's files. So if the bbserver is slow starting up the previous transfer, it's presumably going to be slow for the new process calling BB_InitLibrary(),BB_CreateTransferDef(),BB_GetTransferHandle(),BB_GetLastErrorDetails() as well.

Even if it did race, the worst that would happen is that you'd have an old checkpoint1._AXL-123456 laying around in GPFS, which would be harmless (and would get canceled/deleted if the process restarted yet again).

adammoody commented 4 years ago

True, but it's not the file on GPFS that is the problem.

It's the source file on the SSD that SCR will be overwriting and truncating and/or deleting while the BB software is still trying to read it. I'd prefer to know that the BB is done trying to access that source file before SCR starts to mess with it.

tonyhutter commented 4 years ago

It shouldn't matter. The worst that would happen would be that job1's leftover checkpoint1._AXL-123456 would be corrupted by job2 writing to the same extents on the SSD. But since we don't care about that leftover file anyway, no harm is done to job2.

adammoody commented 4 years ago

Yeah, maybe. I get the feeling we'll trip up problems if we start messing with the source file while the BB is still reading from it.

We need to be sure that the source file on the SSD can be successfully overwritten, truncated, renamed, have its permissions, group, and timestamps changed, and/or be unlinked. For a truncate or unlink, we also need the physical space on the disk to be freed up so that it can be used by new files. We need all of that to work without having the BB software hang or crash.

We could run that question by IBM.

tonyhutter commented 4 years ago

Yea, we'll see what they say: https://github.com/IBM/CAST/issues/931

tonyhutter commented 4 years ago

@adammoody you can see IBM's response here: https://github.com/IBM/CAST/issues/931#issuecomment-643463592

TL;DR there shouldn't be an issue.

adammoody commented 4 years ago

It's not entirely clear yet. I don't remember, have all files been created on GPFS when Start_transfer returns?

I don't think that has to be true. Start_transfer might just queue the files in a list and return to the caller, and then the files get created at some later point when the BB server gets around to it.

tonyhutter commented 4 years ago

Correct, BB_StartTransfer() just kicks off the transfer. The files are transferred soon after. However, as I understand it, once the transfer has started (from the server's point it view), those SSD extents are locked, and can not be allocated to another job until the transfer is complete. That prevents a new job from reading or overwriting those extents.

adammoody commented 4 years ago

There's a subtle detail here between "job" and "job step" we should be clear about. A subsequent job (that runs under a different bsub allocation) will get its own logical volume and will not bother the extents created by the first job. These two jobs have different job ids under bsub.

We don't know yet whether a second job step, one that runs in the same bsub allocation as another job step can trample on the extents from the first job step.

ECP-VeloC / AXL

Expected behavior of AXL after being killed #57