flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
90 stars 41 forks source link

JGF is missing vertices in `rv1` match-format writer #1310

Open jameshcorbett opened 1 month ago

jameshcorbett commented 1 month ago

On rzadams, which was just today configured to use the rv1 match format:

$ flux alloc -N2
flux-job: fqxGU4MP3XV started                                                                 00:00:17
Oct 14 19:16:24.615448 PDT sched-fluxion-resource.err[0]: grow_resource_db_jgf: db.load: unpack_edge: source and/or target vertex not found1654 -> 2196.
Oct 14 19:16:24.615462 PDT sched-fluxion-resource.err[0]: : Invalid argument
Oct 14 19:16:24.615469 PDT sched-fluxion-resource.err[0]: update_resource_db: grow_resource_db: Invalid argument
Oct 14 19:16:24.615473 PDT sched-fluxion-resource.err[0]: update_resource: update_resource_db: Invalid argument
Oct 14 19:16:24.616098 PDT sched-fluxion-resource.err[0]: populate_resource_db_acquire: update_resource: Invalid argument
Oct 14 19:16:24.616106 PDT sched-fluxion-resource.err[0]: populate_resource_db: loading resources using resource.acquire
Oct 14 19:16:24.616108 PDT sched-fluxion-resource.err[0]: init_resource_graph: can't populate graph resource database
Oct 14 19:16:24.616109 PDT sched-fluxion-resource.err[0]: mod_main: can't initialize resource graph database
Oct 14 19:16:24.616397 PDT sched-fluxion-resource.crit[0]: module exiting abnormally
Oct 14 19:16:24.842895 PDT sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Function not implemented
Oct 14 19:16:24.842907 PDT sched-fluxion-qmanager.err[0]: handshake_resource: update_on_resource_response: Function not implemented
Oct 14 19:16:24.842909 PDT sched-fluxion-qmanager.err[0]: handshake: handshake_resource: Function not implemented
Oct 14 19:16:24.842912 PDT sched-fluxion-qmanager.err[0]: mod_start: handshake: Function not implemented
Oct 14 19:16:24.842934 PDT sched-fluxion-qmanager.crit[0]: module exiting abnormally

I confirmed that vertex 1654 is not in the JGF produced for the scheduler, although 2196 is. 1654 is a rack vertex, 2196 is a node vertex.

jameshcorbett commented 1 month ago

The system instance's resource graph has cluster -> rack -> node. The JGF it writes out for child instances does not include rack vertices, however it still writes out the edges from cluster to rack and from rack to node. My current hypothesis is that the writer is coded to include the root of the graph but then skip any intermediate vertices on its way down to node vertices. Hopefully will be a simple fix?

jameshcorbett commented 1 month ago

Strangely, hetchy does not have this problem, it writes out the rack vertex. Something is off and since this is the same cluster as #1305 I wonder if the JGF is wrong somehow.

trws commented 1 month ago

Could you pull an example json object from each of these? I'm looking at the RV1 code, and it doesn't have anything that would trim vertices. It's possible something in the match code is doing it, but something is clearly fishy here.

jameshcorbett commented 1 month ago

Some nodes hit the issue on the cluster, some don't. Here is the JGF for the overall system, and the JGF for one node that hit the error and another that didn't. bad_jgf_cluster.json cluster_R.json good_jgf_cluster.json

jameshcorbett commented 1 month ago

I didn't see any obvious errors in the system JGF but I may well have missed something.

trws commented 1 month ago

It occurred to me looking at this yesterday that there's something we usually don't see in our graphs in here, the cluster-level graph has nodes with a rack and exactly one node that's directly under the cluster vertex. There's no reason that should cause a problem, but I'm not sure it's tested.