flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

mismatched nodelist and node count after additional ranks added to R #6243

Open grondo opened 2 months ago

grondo commented 2 months ago

On elcap, some nodes were added to R and placed at the end of nodelist (such that the nodelist was no longer sorted).

This resulted in an error from the resource module (number of hosts doesn't match number of nodes)

In fact, it appears that in this case hostlist_count() may have a bug. :-(

grondo commented 2 months ago

Ok, probably not a hostlist_count bug, the problem was I was calling

$ flux R parse-config /etc/flux/system/conf.d | flux R decode --nodelist | flux hostlist --count

Which didn't match

$ flux R parse-config /etc/flux/system/conf.d | flux R decode --count=node

The error as pointed out to me by @kkier and @pyrsq is that flux hostlist should be called with flux hostlist --count - above or else it isn't reading stdin but instead is using the enclosing instance hostlist. (Side note: is this UI flaw bad enough that we should change the behavior?).

So, it is still a mystery why the resource module was complaining. I wonder if we've tried adding hosts to an existing config before? Could the resource module be using the existing hostlist instead of the one in the new R?