ECP-VeloC / AXL

Asynchronous Transfer Library
MIT License
2 stars 8 forks source link

Use 'bbcmd' to check BB transfer status in post-stage #96

Closed tonyhutter closed 3 years ago

tonyhutter commented 3 years ago

We want to be able to check the BBAPI transfer status while running from the job node. This is needed for when we do a BBAPI resume transfer from the 2nd post-stage script on IBM machines. In that case we don't have access to the BBAPI, but we can spawn off bbcmd to see if transfer we wanted to resume had already successfully transferred. This patch adds in the bbcmd spawn and check for those cases.

adammoody commented 3 years ago

Thanks, @tonyhutter . A quick skim looks good. I'll take a closer look later. Thanks for figuring that out.

The note about BB startup performance is also interesting. I don't know why interactive jobs would be faster at startup than batch jobs. Strange.

I think your approach here is the cleanest option given the existing AXL API and the fact that we need to invoke bbcmd.

My comment about handling multiple procs is jumping ahead to the next step where we plug this into SCR and run at scale to finalize a checkpoint transfer. In that case, we'll have one transfer handle per process. At the full scale of sierra, that would be 16k handles. With some work on SCR, I think we could cut that 16k to 4k, if we launch a single transfer per node, rather than a transfer per process. Either way, SCR needs to make sure each individual transfer is complete to know that the checkpoint as a whole is complete. I'm still worried we have ourselves in a scalability bind with the current AXL API, as we don't have a way for AXL to consider all of those individual transfers as a set. Anyway, I still don't know if any of this will be a problem, and we won't know until we test it.

tonyhutter commented 3 years ago

One side note on the scaling - I anticipate the scr_poststage script is going to be resuming/finalizing each node's transfer in parallel. So if we're calling poststage on a 16k node job, and the post-stage node has 128 cores (which ours do), it's only going to run a total of 128 bbcmds per core, which isn't great, but seems somewhat doable.

adammoody commented 3 years ago

For querying a single transfer handle, I believe that bbcmd can be given a handle id as an argument. That could help reduce the work that bbcmd has to do in this context, since then it can lookup and return info about a single handle rather than every handle it knows about.

adammoody commented 3 years ago

It might require a different command to do that. Checking the bbcmd --help, perhaps bbcmd getstatus --handle=id ... would work.

tonyhutter commented 3 years ago

@adammoody thanks for looking this over. I added your changes, including switching to bbcmd getstatus.

adammoody commented 3 years ago

Thanks @tonyhutter .