provide a better way for flux-sched to override sched-simple

grondo commented 5 years ago

Currently, flux-sched has to flux module remove sched-simple before loading the qmanager since both modules have to register the sched service, engage job-manager in hello protocol, etc. This works fine, but seems like wasted activity startup, and also results in errors in rc3 (unless flux-sched reloaded sched-simple in its stop scripts, which seems silly)

Ideally, a newly loaded scheduler could "take over" an existing scheduler via some protocol, and we could leave sched-simple loaded. However, I think we stopped short of providing that support in dynamic registration.

Perhaps for now we could move the load of sched-simple to its own rc script in flux-core, named based on the provided service: sched, and then do something like alternatives to link to the current provider from rc1.d/sched?

flux-sched would then update the /etc/flux/rc1.d/sched link to point to its alternative rc1 script?

Yeah, I agree, not the greatest approach...

Maybe we need a higher level service than loading single modules that can load "services" which are provided by name from scripts outside of the rc1.d/* directory. The flux service load (or whatever) command would load configuration from a /etc/flux/services/* directory. Each package that provides a named service drops a config entry into this directory and the last entry loaded wins (so 99-sched-fluxion would override 00-sched-simple for example).

Instead of calling flux module load sched-simple the flux-core rc1 script(s) would instead use flux service load sched and let the flux-service command handle calling the right script. Similarly, the service config could denote a rc3 script for each service provider which would be called from flux service remove/unload.

grondo commented 5 years ago

Related: #1039

grondo commented 5 years ago

Related to #1039, flux service load would need a way to force an implementation even though it is not the highest priority provider, e.g. flux service load --force simple-sched sched.

grondo commented 5 years ago

Final note: one other idea is that all services currently loaded in rc1 could be split into service files. Each service script would flux service load the services it depends on. Then the "rc" script for an instance running the default scheduler would just be:

flux service load sched

And the flux service command would take care of loading services in correct order. If your instance just needed kvs (e.g. in testing) you could potentially initialize with just flux service load kvs

garlick commented 5 years ago

It would be good to figure this one out sooner rather than later.

What about our TOML config? Could we potentially have TOML fragments for each module that expresses dependencies, default options, etc? Then flux-sched could provide one for its install prefix?

I like the idea of a higher level command to deal with modules + dependencies, sort of like modprobe(8). Good candidate for implementation in python IMHO.

dongahn commented 5 years ago

But seems like wasted activity startup, and also results in errors in rc3 (unless flux-sched reloaded sched-simple in its stop scripts, which seems silly)

For now, I can load ached-simple in manager's stop scripts to shut up the error message.

dongahn commented 5 years ago

I don't know what the right solution would be here. But I can say it would be pretty important to make it easy to specialize our scheduling behaviors at different levels.

Ultimately, I can see at the top level we will run the conservative policy on a pretty fine grained resource graph but at a child instance live we run a HTC oriented policy on a coarse grained graph, for instance.

Right now, if the scheduler configuration users want is different than what rc scripts offer, they have to unload and reload qmanager/resource with different parameters. (or use a kludge NOOP environment variable trick, which would be error prone with nesting)

If we can do this such a way that users can effect this scheduling specialization without unloading/reloading (or using some kludge NOOP environment variable trick which is error prone wit nesting), this will be ideal.

garlick commented 4 years ago

Ideally, a newly loaded scheduler could "take over" an existing scheduler via some protocol, and we could leave sched-simple loaded. However, I think we stopped short of providing that support in dynamic registration.

A couple of recent developments would make this challenging nowadays for schedulers:

Schedulers call resource.acquire to obtain and monitor resources. If two schedulers do that, then the first one wins and the second one fails. We'd have to go with a slightly different semantic for resource aquisition
It's become more common to perform save/restore of module data to/from the KVS during module initialization/finalization, which would be wasted effort for unused modules.

Ack! I wanted to say more but I just realized I'm late for an appointment!

grondo commented 4 years ago

A couple of recent developments would make this challenging nowadays for schedulers:

Good points. I think the idea of scheduler take-over was good at the time and would be convenient. But you are correct that design shouldn't be considered anymore.

garlick commented 4 years ago

@grondo said in #2946

How do you enforce order of sched module loading (if that happens to be required)?

Ah, well that was a dumb question, since the proposed config key is an array. Sorry not thinking clearly today I suppose.

Thinking about use cases, it would be nice if there was a way to encapsulate the scheduler choice into a single string, eg.

$ flux start -o,-S sched=fluxion

instead of

$ flux start -o,-S,sched.modules=sched-fluxion-qmanager,sched-fluxion-resource

Which also gives the user the opportunity to cause modules to load in the wrong order, if order matters.

Could each scheduler (or other replaceable service) provide a toml config, with enough info to load itself, into a well known location under a specific name. Then with flux-start or flux-broker we add an option to select from these named configs?

In fact, since reading a toml table will override the previous table, would it work to have a default

[sched]
modules = [ 'sched-simple' ]

Then if another sched config is selected, the default is overridden?

garlick commented 4 years ago

Selecting the scheduler by a single name and hiding the details of module loading seems good!

I'm not seeing how flux-sched (say) could install a TOML fragment someplace that gets pulled in conditionally. Have to ponder that for a bit I think. Could the TOML config reference a script provided by flux-sched? Then at least the script is conditionally invoked rather than being just another rc fragment that gets invoked unconditionally...

grondo commented 4 years ago

I'm not seeing how flux-sched (say) could install a TOML fragment someplace that gets pulled in conditionally. Have to ponder that for a bit I think. Could the TOML config reference a script provided by flux-sched?

An rc script, conditionally loaded by name via broker attribute (or some other flux-start option) would be even better. I had only referenced a config file since I was following the initial idea in #2946.

However, loading config fragments from the flux-start/broker command line may be very useful as well, so it would be nice if we could support that as well. Especially if tables could be updated instead of overwritten.

For example, an advanced scheduler may have many tunable parameters. Once a workflow user has determined the right configuration for a scheduler, it would be nice if they could drop a TOML config in their homedir and reference that on the command line when starting an instance.

Or, a site could provide a few different "named" scheduler configurations which could be selected at runtime by a single string. (I guess this could also be accomplished by multiple rc scripts though) This isn't just applicable to the scheduler config, either (I'm thinking content-store, job-archive, etc)

I don't remember exactly where TOML config is loaded by broker, but if it is loaded early, could the broker use the following steps to allow config fragments to be pulled in conditionally?

Load all default TOML from the built in config glob first
Support loading "named" configuration files from directories in a built in PATH that includes: sysconfdir/flux/configs:~/.flux/configs
- For each config name provided on the command line, first load system config, then user config. It is an error if neither exists.
- Load TOML files using json_object_update() so that named config can override individual table values, instead of the whole table. Thus users can update individual keys instead of whole tables.

Users could also have a ~/.flux/configs/default/*.toml files that are always loaded for their own instances.

Apologies if the above extemporaneous description is ill conceived. I just wanted to throw an idea out there that described my high-level thoughts on the matter.

garlick commented 4 years ago

That seems like it solves a lot of problems! I like it!

One point (neither here nor there really): currently there is no default config unless a user sets --config-path or FLUX_CONF_DIR env var. The systemd unit file sets --config-path=sysconfdir/flux/system/conf.d but the default is an empty config object. If we keep it that way, it just skips step 1 above and makes all config loading explicit, which seems OK to me.

A refinement might be to add support for something resembling an "include" directive so that configs could reference other named configs?

flux-framework / flux-core

provide a better way for flux-sched to override sched-simple #2273