Defer output file transfers.

eirrgang commented 3 years ago

Defer output file transfers when not known to be necessary (e.g. through a subscription in the workflow or through a call to result()).

Most data dependencies are only needed at the execution site, or at least the client output staging should not block dependent task execution locally to the execution environment.

Generically, commands may have output data that the client does not actually need locally.

In the immediate near term, we should probably stage all output data all of the time, but we need to provide more flexible behavior for workflows that involve large data.

This functionality will require metadata tracking at both the client and agent side, and support to dynamically issue Pilot data staging commands, possibly requiring Pilots to be launched if no workflow is currently executing.

also relates to #75

andre-merzky commented 3 years ago

possibly requiring Pilots to be launched

Wouldn't in that case a remote data access suffice?

eirrgang commented 3 years ago

possibly requiring Pilots to be launched

Wouldn't in that case a remote data access suffice?

Yes. Would you recommend using RP for that? SagaUtils? Should we convert pilot:/// and task:/// references to a more complete resource reference to facilitate retrieval from outside the Session?

andre-merzky commented 3 years ago

SAGA would be the right tool for this. I does not understand the RP specific staging schemas - but the absolute sandbox URLs are available on the pilot (pilot.sandbox) and task (task.sandbox) instances. Those URLs should probably be retrieved and stored anyway, as the schemas are not useful outside of the current RP session.

eirrgang commented 2 years ago

This issue should be updated once we clarify the behavior of a proposed scalems.write for explicity user file staging.

SCALE-MS / scale-ms

Defer output file transfers. #97