ECP-VeloC / AXL

Asynchronous Transfer Library
MIT License
2 stars 8 forks source link

WIP - NNFDM implementation in AXL #131

Open mcfadden8 opened 1 year ago

mcfadden8 commented 1 year ago

It basically compiles and cannot be fully tested until we have an operational server.

adammoody commented 1 year ago

Thanks @mcfadden8 . Nicely done.

I haven't checked things under this context, but it would be good to think through whether the HPE API supports SCR's scalable restart and scavenge operations.

In the case of a scalable restart, we normally try to cancel any outstanding flush. Since there is no way to cancel, I think we'd need the restarted job to be able to resume and/or wait on any outstanding flush that was started from a previous run, i.e., I don't think we'd want the restarted job to initiate a new flush of the same files that are already in progress from a flush in a prior run.

For scavenge, is there a way for the job script to see the status of a flush started by the last run? If not, will there be problems if we try to copy the files again while a flush may still be ongoing?

mcfadden8 commented 1 year ago

Thanks @mcfadden8 . Nicely done.

I haven't checked things under this context, but it would be good to think through whether the HPE API supports SCR's scalable restart and scavenge operations.

In the case of a scalable restart, we normally try to cancel any outstanding flush. Since there is no way to cancel, I think we'd need the restarted job to be able to resume and/or wait on any outstanding flush that was started from a previous run, i.e., I don't think we'd want the restarted job to initiate a new flush of the same files that are already in progress from a flush in a prior run.

For scavenge, is there a way for the job script to see the status of a flush started by the last run? If not, will there be problems if we try to copy the files again while a flush may still be ongoing?

Hi @adammoody, for scalable restart, I agree that we would need an API to cancel any outstanding flushes.

For both scalable restart and scavange, I think we will need a way to list the requests that are still in progress from any previous runs.

I think that these requests may already be documented, but we should discuss to be sure I am understanding things correctly.

mcfadden8 commented 1 year ago

@adammoody - I've integrated with the latest C++ api provided from HPE. My next step will be to add in their new API for canceling and enumeration of old jobs which should allow us to support scalable restart and scavenge.

adammoody commented 1 year ago

Nice. Thanks, @mcfadden8