use case: SCR needs to hold onto resources while it copies data to global storage

grondo commented 5 years ago

As described by @kathrynmohror in #2040:

From what Dong said, you are discussing the case where something like SCR would need to hold on to allocation resources after the main application finishes successfully or terminates abnormally. I agree that is a needed capability. For SCR we will want to to drain data from SSDs or memory (or whatever storage) to the parallel file system or we may also restart the job in the same allocation.

We should ensure that we have an initial plan to support this sort of thing via plugin or cleanup/epilog script etc.

grondo commented 5 years ago

Also from #2040:

I kind of wonder if SCR support will be a function of the "job shell" (the user process that actually runs user compute tasks). Resources aren't released until the job shell exits, so an SCR-enabled job shell could just keep the job shell from exiting until it has finished draining data.

More specifically, resources are not released on a rank until the job shell exits and any subsequent cleanup has been issued and completes successfully on that rank. Here cleanup includes executing a set of cleanup/epilog scripts, as well as killing all stray processes and destroying any cgroups and namspaces for the job.

Therefore other plugins and tools can hold resources for the job by adding themselves to the system cleanup/epilog configuration (probably via scripts). Note that work in the cleanup phase is executed as a privileged user by the IMP, and therefore available plugins/tools would have to be configured by system administrators (possibly selectable at runtime).

In the future it might be interesting if unneeded resources could optionally be released before or during cleanup for long-running cleanup tasks. For example, a cleanup task that copies data off local storage could (somehow) release all resources except a core and the local storage itself for the duration of the copy.

ofaaland commented 2 years ago

As described by @kathrynmohror in https://github.com/flux-framework/flux-core/issues/2040:

From what Dong said, you are discussing the case where something like SCR would need to hold on to allocation resources after the main application finishes successfully or terminates abnormally. I agree that is a needed capability. For SCR we will want to to drain data from SSDs or memory (or whatever storage) to the parallel file system or we may also restart the job in the same allocation.

We should ensure that we have an initial plan to support this sort of thing via plugin or cleanup/epilog script etc.

Note that one form of "application terminates abnormally" is that one of the nodes failed, e.g. crashed due to a Lustre bug or a hardware issue, in case that affects your architecture.

grondo commented 4 months ago

Coming back around to some old issues and noticed this one.

There's currently a plan to move the adminstrative epilog/cleanup scripts to a "housekeeping" service which would not be associated with the job. Therefore, it is probably no longer (or will be no longer) appropriate to use the epilog script(s) to handle this movement of data.

A couple of alternatives come to mind here. I'm not sure of the status of SCR and Flux so these may not be good solutions either, but I'll throw them out there for reference:

a job shell plugin can take a "completion reference" via flux_shell_add_completion_ref(). The job shell itself will not consider the job complete until all completion references are dropped via the corresponding flux_shell_remove_completion_ref()
There's a similar option for jobtap plugins to hold on to resources by issuing an epilog-start event via a call to flux_jobtap_epilog_start(). Once resources can be released, flux_jobtap_epilog_finish() with a matching description should be called and releases the reference.

flux-framework / flux-core

use case: SCR needs to hold onto resources while it copies data to global storage #2080