flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Suggestion: reject configurations that don't match rank 0 #6391

Open kkier opened 3 hours ago

kkier commented 3 hours ago

Reference https://github.com/flux-framework/flux-core/issues/6389

My understanding is that there's no situation where a mismatched configuration would be desirable. If that's true, it'd be useful to outright reject connections from nodes with a mismatched configuration, preferably with an error message indicating what setting doesn't match. Similar to the way version mismatches cause a rejection and no further processing, just to prevent downstream issues.

garlick commented 3 hours ago

Hmm, that's definitely possible to check at connect time, although two concerns:

1) Most configuration is not sensitive to being different. In fact a lot of it only applies to rank 0

2) Some config can be updated on the fly. How could we ensure an updated config matches upstream without making it really awkward to push out an update?

wihobbs commented 3 hours ago

My understanding is that there's no situation where a mismatched configuration would be desirable.

At connection time, this is probably correct. However, for a running instance with a running connection, we use "mismatched" configurations to test prologs/epilogs (this is somewhat related to #5531). Start a job across a bunch of nodes, then mess with the imp to test out a new prolog/epilog. Just something to keep in mind.

kkier commented 2 hours ago

Hmm, that's definitely possible to check at connect time, although two concerns:

1. Most configuration is _not_ sensitive to being different.  In fact a lot of it only applies to rank 0

2. Some config can be updated on the fly.   How could we ensure an updated config matches upstream without making it really awkward to push out an update?

Re: 1 - Are there things that aren't sensitive to being different and where you might want them to be so, beyond test cases? It definitely adds some complexity if we'd have to create a list of what does and doesn't matter.

My understanding is that there's no situation where a mismatched configuration would be desirable.

At connection time, this is probably correct. However, for a running instance with a running connection, we use "mismatched" configurations to test prologs/epilogs (this is somewhat related to #5531). Start a job across a bunch of nodes, then mess with the imp to test out a new prolog/epilog. Just something to keep in mind.

Oh, yes, I'm just thinking in terms of connection time. Constantly checking the config seems very out of scope.

wihobbs commented 2 hours ago

Sorry I misunderstood! Just wanted to throw that sort of niche edge case in there.

kkier commented 2 hours ago

Sorry I misunderstood! Just wanted to throw that sort of niche edge case in there.

It's a good point, taken as written you can totally read it as "if the config changes, drop the node" which is also a much more complicated issue.