flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

config for combined working and mock experiment #5295

Open vsoch opened 1 year ago

vsoch commented 1 year ago

To follow up to the discussion started here:

https://github.com/flux-framework/flux-sched/issues/1009#issuecomment-1610039068

I'm trying to get it working to be able to run a batch job that has a combination of working (real) nodes and some that are mocked (don't work). We will eventually want the nodes that are mocked to not accept jobs, period, but that is a next step. Right now I'm trying to get one set actually running, and one not. I'm doing this work here: https://github.com/flux-framework/flux-operator/tree/child-broker-experiment/examples/experimental/child-broker#combined

I think I'm close - because I've added two queues (batch and debug, not sure why I can't name them something else?) and then put each respective group (flux-sample for real, burst for fake) to the queues, but for some reason flux thinks the batch has a lot more cores than it does! Here is the start.sh:

#!/bin/bash
MATCH_FORMAT=${MATCH_FORMAT:-rv1}
NJOBS=${NJOBS:-10}
NNODES=${NNODES:-6}
printf "MATCH_FORMAT=${MATCH_FORMAT} NJOBS=$NJOBS NODES/JOB=$NNODES\n"

flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux config load <<EOF
[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"

[resource]
noverify = true
norestrict = true

# Why can't I name these something else?
[queues.debug]
requires = ["debug"]

[queues.batch]
requires = ["batch"]

[[resource.config]]
hosts = "flux-sample[0-3]"
properties = ["batch"]

[[resource.config]]
hosts = "burst[0-99]"
properties = ["debug"]

[[resource.config]]
hosts = "flux-sample[0-3],burst[0-99]"
cores = "0-103"
EOF
flux config get | jq '."sched-fluxion-resource"'
flux module load resource noverify monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all --quiet
flux resource list
t0=$(date +%s.%N)

# These are real jobs
flux submit -N$NNODES --queue=batch \
    --quiet --wait hostname

# These are fake jobs
flux submit -N$NNODES --cc=1-$NJOBS --queue=debug \
    --setattr=exec.test.run_duration=1ms \
    --quiet --wait hostname

ELAPSED=$(echo $(date +%s.%N) - $t0 | bc -l)
THROUGHPUT=$(echo $NJOBS/$ELAPSED | bc -l)
R_SIZE=$(flux job info $(flux job last) R | wc -c)
OBJ_COUNT=$(flux module stats content-sqlite | jq .object_count)
DB_SIZE=$(flux module stats content-sqlite | jq .dbfile_size)

printf "%-12s %5d %4d %8.2f %8.2f %12d %12d %12d\n" \
        $MATCH_FORMAT $NJOBS $NNODES $ELAPSED $THROUGHPUT \
        $R_SIZE $OBJ_COUNT $DB_SIZE
flux jobs -a
flux jobs -a --json | jq .jobs[0]

and what happens:

$ flux batch -n1 ./combined/start.sh 
MATCH_FORMAT=rv1 NJOBS=10 NODES/JOB=6
{
  "match-format": "rv1"
}
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free batch           4      416        0 flux-sample[0-3]
      free debug         100    10400        0 burst[0-99]
 allocated                 0        0        0 
      down                 0        0        0 
ƒpyjkxF: exception: type=alloc note=alloc denied due to type="unsatisfiable"
rv1             10    6     2.07     4.82       194582          604       466944
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒxPgAgG debug    flux     hostname   CD      6      6   0.311s burst[40-45]
     ƒxPgAgF debug    flux     hostname   CD      6      6   0.485s burst[46-51]
     ƒxLiC7Z debug    flux     hostname   CD      6      6   0.485s burst[52-57]
     ƒxKECqD debug    flux     hostname   CD      6      6   0.468s burst[58-63]
     ƒxGGEGY debug    flux     hostname   CD      6      6   0.452s burst[64-69]
     ƒxGGEGX debug    flux     hostname   CD      6      6   0.436s burst[70-75]
     ƒxALH99 debug    flux     hostname   CD      6      6   0.414s burst[76-81]
     ƒwn6Tuy debug    flux     hostname   CD      6      6   0.384s burst[82-87]
     ƒwaEZfD debug    flux     hostname   CD      6      6   0.355s burst[88-93]
     ƒwRLdy9 debug    flux     hostname   CD      6      6   0.272s burst[94-99]
     ƒpyjkxF batch    flux     hostname    F      6      6        - 
{
  "t_depend": 1687898804.1769416,
  "t_run": 1687898804.9159055,
  "t_cleanup": 1687898805.2271707,
  "t_inactive": 1687898805.2904124,
  "duration": 0,
  "expiration": 4841498804,
  "name": "hostname",
  "cwd": "/tmp/workflow",
  "queue": "debug",
  "ntasks": 6,
  "ncores": 624,
  "nnodes": 6,
  "priority": 16,
  "ranks": "[44-49]",
  "nodelist": "burst[40-45]",
  "success": true,
  "result": "COMPLETED",
  "waitstatus": 0,
  "id": 36356227073,
  "t_submit": 1687898804.149426,
  "state": "INACTIVE",
  "username": "flux",
  "userid": 1000,
  "urgency": 16,
  "runtime": 0.311265230178833,
  "status": "COMPLETED",
  "returncode": 0,
  "dependencies": [],
  "annotations": {},
  "exception": {
    "occurred": false
  }
}

Note that the "real" hostname submit fails, and the fake ones are ok. I think the issue is the 416 cores (indeed my local machine doesn't have that many!) So questions:

  1. How do I get the batch queue to get the correct number of cores so the job runs
  2. After that, how do I simply add the mocked nodes as down (but not allow running jobs on them?) is it just a matter of removing that run_duration flag? But I don't want it to fail with "unsatisfiable" I would want them to be scheduled if the mock resources could potentially support the jobs!

Thank you!

grondo commented 1 year ago

I think the issue is the 416 cores (indeed my local machine doesn't have that many!) So questions:

You are explicitly configuring all your nodes with 104 cores here:

[[resource.config]]
hosts = "flux-sample[0-3],burst[0-99]"
cores = "0-103"

Also:

# Why can't I name these something else?
[queues.debug]
requires = ["debug"]

You can name them whatever you want. Did you try it and it didn't work?

grondo commented 1 year ago

After that, how do I simply add the mocked nodes as down (but not allow running jobs on them?) is it just a matter of removing that run_duration flag? But I don't want it to fail with "unsatisfiable" I would want them to be scheduled if the mock resources could potentially support the jobs!

The scheduler will not schedule jobs onto down nodes no matter what. So am I right in understanding what you want is to run a test configured with 4 "real" nodes flux-sample[0-3], but 100 fake nodes burst[0-99], where the real nodes are available, but the burst nodes are down?

Here's some ideas/pointers:

grondo commented 1 year ago

BTW, I think a slightly less kludgy way will be possible once #5184 is merged since then you can at least override the instance size attribute to include the burst nodes. However, in practice, the scheduler doesn't really care about the instance size, it is just operating on the resource set provided by the resource module's response to resource.acquire.

vsoch commented 1 year ago

You can name them whatever you want. Did you try it and it didn't work?

I must have had a bug the first time - it worked this time! I just wanted to rename to online and offline to be explicit for the example.

The scheduler will not schedule jobs onto down nodes no matter what. So am I right in understanding what you want is to run a test configured with 4 "real" nodes flux-sample[0-3], but 100 fake nodes burst[0-99], where the real nodes are available, but the burst nodes are down?

Yes correct!

Okay trying to go off of what you said - here are some experiments. First, removing the monitor-force-up and adding the burst node section last (and removing from the main config section)

[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"

[resource]
noverify = true
norestrict = true

[queues.offline]
requires = ["offline"]

[queues.online]
requires = ["online"]

[[resource.config]]
hosts = "flux-sample[0-3]"
properties = ["online"]

[[resource.config]]
hosts = "flux-sample[0-3]"
cores = "0-3"

[[resource.config]]
hosts = "burst[0-99]"
properties = ["offline"]

that is angry that the resources are not defined:

Jun 27 22:55:37.943311 resource.err[0]: error parsing [resource.config] array: resource.config: burst[0-99] assigned no resources
Jun 27 22:55:37.943324 resource.crit[0]: module exiting abnormally
flux-module: load resource: Connection reset by peer

Trying to add them back - this is closer / almost what we want because I see the correct cores for flux-sample (yay!) however, the other jobs are unsatisfiable.

[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"

[resource]
noverify = true
norestrict = true

[queues.offline]
requires = ["offline"]

[queues.online]
requires = ["online"]

[[resource.config]]
hosts = "flux-sample[0-3]"
properties = ["online"]

[[resource.config]]
hosts = "flux-sample[0-3],burst[0-99]"
cores = "0-3"

[[resource.config]]
hosts = "burst[0-99]"
properties = ["offline"]
MATCH_FORMAT=rv1 NJOBS=10 NODES/JOB=6
{
  "match-format": "rv1"
}
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free online          4       16        0 flux-sample[0-3]
      free offline       100      400        0 burst[0-99]
 allocated                 0        0        0 
      down                 0        0        0 
ƒVby9Ls: exception: type=alloc note=alloc denied due to type="unsatisfiable"
rv1             10    6     0.72    13.95         9718          416       196608
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒaHGsv2 offline  flux     hostname   CD      6      6   0.033s burst[40-45]
     ƒaHGsv1 offline  flux     hostname   CD      6      6   0.033s burst[46-51]
     ƒaHGsuz offline  flux     hostname   CD      6      6   0.032s burst[52-57]
     ƒaHGsuy offline  flux     hostname   CD      6      6   0.032s burst[58-63]
     ƒaFntdg offline  flux     hostname   CD      6      6   0.029s burst[64-69]
     ƒaFntdf offline  flux     hostname   CD      6      6   0.028s burst[70-75]
     ƒaFntde offline  flux     hostname   CD      6      6   0.025s burst[76-81]
     ƒaFntdd offline  flux     hostname   CD      6      6   0.033s burst[82-87]
     ƒaEJuMH offline  flux     hostname   CD      6      6   0.032s burst[88-93]
     ƒa6txwZ offline  flux     hostname   CD      6      6   0.032s burst[94-99]
     ƒVby9Ls online   flux     hostname    F      6      6        - 
{
  "t_depend": 1687906682.6507435,
  "t_run": 1687906682.715299,
  "t_cleanup": 1687906682.7485518,
  "t_inactive": 1687906682.7660186,
  "duration": 0,
  "expiration": 4841506682,
  "name": "hostname",
  "cwd": "/tmp/workflow",
  "queue": "offline",
  "ntasks": 6,
  "ncores": 24,
  "nnodes": 6,
  "priority": 16,
  "ranks": "[44-49]",
  "nodelist": "burst[40-45]",
  "success": true,
  "result": "COMPLETED",
  "waitstatus": 0,
  "id": 21843935235,
  "t_submit": 1687906682.6404624,
  "state": "INACTIVE",
  "username": "flux",
  "userid": 1000,
  "urgency": 16,
  "runtime": 0.03325295448303223,
  "status": "COMPLETED",
  "returncode": 0,
  "dependencies": [],
  "annotations": {},
  "exception": {
    "occurred": false
  }
}

Adding back the cores to the main resource section I get the weirdness about flux-sample cores again.

[[resource.config]]
hosts = "flux-sample[0-3],burst[0-99]"
cores = "0-104"
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free online          4      420        0 flux-sample[0-3]
      free offline       100    10500        0 burst[0-99]
 allocated                 0        0        0 
      down                 0        0        0 

And a variant of the closer one - but trying to add cores to the offline spec - same outcome.

[sched-fluxion-qmanager]
queue-policy = "easy"
[sched-fluxion-resource]
match-format = "$MATCH_FORMAT"

[resource]
noverify = true
norestrict = true

[queues.offline]
requires = ["offline"]

[queues.online]
requires = ["online"]

[[resource.config]]
hosts = "flux-sample[0-3]"
properties = ["online"]

[[resource.config]]
hosts = "flux-sample[0-3],burst[0-99]"
cores = "0-3"

[[resource.config]]
hosts = "burst[0-99]"
properties = ["offline"]
cores = "4-103"
MATCH_FORMAT=rv1 NJOBS=10 NODES/JOB=6
{
  "match-format": "rv1"
}
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free online          4       16        0 flux-sample[0-3]
      free offline       100    10400        0 burst[0-99]
 allocated                 0        0        0 
      down                 0        0        0 
ƒWKUoEX: exception: type=alloc note=alloc denied due to type="unsatisfiable"
rv1             10    6     1.33     7.51       194598          552       450560
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒaV8nAj offline  flux     hostname   CD      6      6   0.307s burst[40-45]
     ƒaTentU offline  flux     hostname   CD      6      6   0.381s burst[46-51]
     ƒaTentT offline  flux     hostname   CD      6      6   0.381s burst[52-57]
     ƒaTentS offline  flux     hostname   CD      6      6   0.357s burst[58-63]
     ƒaTentR offline  flux     hostname   CD      6      6   0.332s burst[64-69]
     ƒaTentQ offline  flux     hostname   CD      6      6   0.309s burst[70-75]
     ƒaTentP offline  flux     hostname   CD      6      6   0.285s burst[76-81]
     ƒaQgpKh offline  flux     hostname   CD      6      6   0.258s burst[82-87]
     ƒaPCq3M offline  flux     hostname   CD      6      6   0.218s burst[88-93]
     ƒaMiqm1 offline  flux     hostname   CD      6      6   0.173s burst[94-99]
     ƒWKUoEX online   flux     hostname    F      6      6        - 
{
  "t_depend": 1687906880.5886397,
  "t_run": 1687906881.019693,
  "t_cleanup": 1687906881.3267233,
  "t_inactive": 1687906881.3751292,
  "duration": 0,
  "expiration": 4841506880,
  "name": "hostname",
  "cwd": "/tmp/workflow",
  "queue": "offline",
  "ntasks": 6,
  "ncores": 624,
  "nnodes": 6,
  "priority": 16,
  "ranks": "[44-49]",
  "nodelist": "burst[40-45]",
  "success": true,
  "result": "COMPLETED",
  "waitstatus": 0,
  "id": 21978152960,
  "t_submit": 1687906880.5760133,
  "state": "INACTIVE",
  "username": "flux",
  "userid": 1000,
  "urgency": 16,
  "runtime": 0.30703043937683105,
  "status": "COMPLETED",
  "returncode": 0,
  "dependencies": [],
  "annotations": {},
  "exception": {
    "occurred": false
  }
}

So that technically is closest to what we want, but we would need an override somewhere that says "allow me to schedule resources that I don't have."

BTW, I think a slightly less kludgy way will be possible once https://github.com/flux-framework/flux-core/pull/5184 is merged since then you can at least override the instance size attribute to include the burst nodes. However, in practice, the scheduler doesn't really care about the instance size, it is just operating on the resource set provided by the resource module's response to resource.acquire.

yeah totally! I can wait until that is merged (I'm already watching it) and then try the above again. Sorry - I get excited about things and then dive in (and probably it might be better to wait sometimes).

vsoch commented 1 year ago

oh wait! I think I have a bug in the above - let me fix it quickly.

Update: the bug was asking for NNODES (6) for the local submit (I only have 4) so I changed that to:

# These are real jobs (1 node each)
flux submit -N1 --queue=online \
    --quiet --wait hostname

But I get a weird "lost contact" error - ET phone home!

ƒXrS49D: exception: type=exec note=lost contact with job shell on broker (null) (rank 3)

And the job is reported as failed:

rv1             10    6     1.37     7.32       194598          548       446464
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
     ƒcGuuts offline  flux     hostname   CD      6      6   0.214s burst[40-45]
     ƒcFRvcZ offline  flux     hostname   CD      6      6   0.390s burst[46-51]
     ƒcFRvcY offline  flux     hostname   CD      6      6   0.390s burst[52-57]
     ƒcFRvcX offline  flux     hostname   CD      6      6   0.378s burst[58-63]
     ƒcDwwLG offline  flux     hostname   CD      6      6   0.364s burst[64-69]
     ƒcDwwLF offline  flux     hostname   CD      6      6   0.350s burst[70-75]
     ƒcDwwLE offline  flux     hostname   CD      6      6   0.336s burst[76-81]
     ƒcDwwLD offline  flux     hostname   CD      6      6   0.310s burst[82-87]
     ƒcDwwLC offline  flux     hostname   CD      6      6   0.282s burst[88-93]
     ƒcDwwLB offline  flux     hostname   CD      6      6   0.233s burst[94-99]
     ƒXrS49D online   flux     hostname    F      1      1   0.003s flux-sample3

detail doesn't give more info:

  "exception": {
    "occurred": true,
    "severity": 0,
    "type": "exec",
    "note": "lost contact with job shell on broker (null) (rank 3)"
  }
grondo commented 1 year ago

error parsing [resource.config] array: resource.config: burst[0-99] assigned no resources

In that example, the burst nodes were assigned no cores (i.e. no resources). That is a configuration error.

Your second attempt looks good to me:

Trying to add them back - this is closer / almost what we want because I see the correct cores for flux-sample (yay!) however, the other jobs are unsatisfiable.

I couldn't find the unsatisfiable job request here. Are you sure you were submitting the jobs to the offline queue? The satisfiability checks take into account the job requirements, so if you request the online queue a job that requests more than 4 nodes or 16 cores would be rejected as unsatisfiable.

vsoch commented 1 year ago

I couldn't find the unsatisfiable job request here.

This was a bug on my part - I was asking for 6 nodes but I only had 4. When I fixed that:

ƒXrS49D: exception: type=exec note=lost contact with job shell on broker (null) (rank 3)

grondo commented 1 year ago

Ah, ok. Looks like that broker rank went away (flux-sample-3), or somehow the job-exec module otherwise got EHOSTUNREACH from the remote execution of the job shell?

What does flux resource status show after that job fails? What about flux overlay status (note that output will be cluttered with 100 offline burst nodes...)

vsoch commented 1 year ago

Ah, so you led me down the path to debugging this!

$ cat overlay-status.txt 
0 flux-sample-3: full
flux@flux-sample-0:/tmp/workflow$ cat resource-status.txt 
     STATE UP NNODES NODELIST
     avail  ✔      1 flux-sample0
    avail*  ✗    103 flux-sample[1-3],burst[0-99]

I realized we only are actually running the job with n1 so it doesn't actually have all those resources! I updated to allow passing them all on to the batch, like:

- flux batch -n1 ./combined/start.sh
+ flux batch -N 4 ./combined/start.sh

And now - boum!

$ cat overlay-status.txt 
0 flux-sample-0: full
├─ 1 flux-sample-1: full
│  └─ 3 flux-sample-3: full
└─ 2 flux-sample-2: full
flux@flux-sample-0:/tmp/workflow$ cat resource-status.txt 
     STATE UP NNODES NODELIST
     avail  ✔      4 flux-sample[0-3]
    avail*  ✗    100 burst[0-99]

So that works! And that's the outcome we'd want for this early testing. But a question - given that I don't want to pass forward all the resources of the parent to the child instance, how would I know which subset are selected? E.g., if I do:

$ flux batch -N 2 ./combined/start.sh

How would I know which of flux-sample[..] to write into the broker.toml?

vsoch commented 1 year ago

Here is the current state of our experiments - https://github.com/flux-framework/flux-operator/tree/child-broker-experiment/examples/experimental/child-broker#combined I think next step is either to discuss:

  1. How can we predict the hostnames we will get given a subset of jobs (e.g., the -N 2 case when there are 4 to select from)
  2. How can we allow the "mock" jobs to be scheduled (but not attempted to be run) without the test flag (it throws up on me, see below)
     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free online          4       16        0 flux-sample[0-3]
      free offline       100    10400        0 burst[0-99]
 allocated                 0        0        0 
      down                 0        0        0 
Jun 28 04:23:16.258777 job-exec.err[0]: ƒ2he9DVh: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.258917 job-exec.err[0]: ƒ2he9DVh: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.259023 job-exec.err[0]: ƒ2he9DVh: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.259135 job-exec.err[0]: ƒ2he9DVh: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.259244 job-exec.err[0]: ƒ2he9DVh: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.261002 job-exec.err[0]: ƒ2hh7C4P: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.261141 job-exec.err[0]: ƒ2hh7C4P: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.261255 job-exec.err[0]: ƒ2hh7C4P: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.261364 job-exec.err[0]: ƒ2hh7C4P: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.261462 job-exec.err[0]: ƒ2hh7C4P: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262293 job-exec.err[0]: ƒ2hibBLj: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262307 job-exec.err[0]: ƒ2hibBLj: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262316 job-exec.err[0]: ƒ2hibBLj: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262325 job-exec.err[0]: ƒ2hibBLj: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262345 job-exec.err[0]: ƒ2hibBLj: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262861 job-exec.err[0]: ƒ2hk5Ad5: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262873 job-exec.err[0]: ƒ2hk5Ad5: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262883 job-exec.err[0]: ƒ2hk5Ad5: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262891 job-exec.err[0]: ƒ2hk5Ad5: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.262899 job-exec.err[0]: ƒ2hk5Ad5: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.263472 job-exec.err[0]: ƒ2ho39Bm: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.263485 job-exec.err[0]: ƒ2ho39Bm: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.263494 job-exec.err[0]: ƒ2ho39Bm: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.263502 job-exec.err[0]: ƒ2ho39Bm: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.264986 job-exec.err[0]: ƒ2ho39Bm: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.265566 job-exec.err[0]: ƒ2hpX8U7: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.265681 job-exec.err[0]: ƒ2hpX8U7: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.265787 job-exec.err[0]: ƒ2hpX8U7: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.265889 job-exec.err[0]: ƒ2hpX8U7: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266285 job-exec.err[0]: ƒ2hpX8U7: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266639 job-exec.err[0]: ƒ2hr17kT: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266653 job-exec.err[0]: ƒ2hr17kT: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266662 job-exec.err[0]: ƒ2hr17kT: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266670 job-exec.err[0]: ƒ2hr17kT: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266677 job-exec.err[0]: ƒ2hr17kT: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266949 job-exec.err[0]: ƒ2hsV72o: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266961 job-exec.err[0]: ƒ2hsV72o: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266969 job-exec.err[0]: ƒ2hsV72o: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266977 job-exec.err[0]: ƒ2hsV72o: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.266985 job-exec.err[0]: ƒ2hsV72o: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267253 job-exec.err[0]: ƒ2hww4sq: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267265 job-exec.err[0]: ƒ2hww4sq: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267273 job-exec.err[0]: ƒ2hww4sq: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267282 job-exec.err[0]: ƒ2hww4sq: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267290 job-exec.err[0]: ƒ2hww4sq: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267546 job-exec.err[0]: ƒ2iHCuYK: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267557 job-exec.err[0]: ƒ2iHCuYK: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267566 job-exec.err[0]: ƒ2iHCuYK: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267574 job-exec.err[0]: ƒ2iHCuYK: exec_kill: any (rank 4294967295): No such file or directory
Jun 28 04:23:16.267582 job-exec.err[0]: ƒ2iHCuYK: exec_kill: any (rank 4294967295): No such file or directory
ƒ2he9DVh: exception: type=exec note=lost contact with job shell on broker (null) (rank 98)
ƒ2hh7C4P: exception: type=exec note=lost contact with job shell on broker (null) (rank 92)
ƒ2hibBLj: exception: type=exec note=lost contact with job shell on broker (null) (rank 86)
ƒ2hk5Ad5: exception: type=exec note=lost contact with job shell on broker (null) (rank 80)
ƒ2ho39Bm: exception: type=exec note=lost contact with job shell on broker (null) (rank 74)
ƒ2hpX8U7: exception: type=exec note=lost contact with job shell on broker (null) (rank 68)
ƒ2hr17kT: exception: type=exec note=lost contact with job shell on broker (null) (rank 62)
ƒ2hsV72o: exception: type=exec note=lost contact with job shell on broker (null) (rank 56)
ƒ2hww4sq: exception: type=exec note=lost contact with job shell on broker (null) (rank 50)
ƒ2iHCuYK: exception: type=exec note=lost contact with job shell on broker (null) (rank 44)

I think we'd want the burst nodes to be flagged as DOWN

grondo commented 1 year ago

How can we predict the hostnames we will get given a subset of jobs (e.g., the -N 2 case when there are 4 to select from)

There are two methods

  1. query the hostnames the job was assigned from within the job itself, i.e. in the start.sh script fetch the nodelist with flux getattr hostlist (this is the preferable method IMO)
  2. request specific hostnames with the --requires option: e.g. flux batch -N2 --requires host:flux-sample-[0-1] ...

How can we allow the "mock" jobs to be scheduled (but not attempted to be run) without the test flag (it throws up on me, see below)

You can't schedule a job without it attempting to be run (or simulated to be run as with the mock execution). Or do you mean submit a job and have it stay in the SCHED state (i.e. pending) while the burst nodes are down?

Since the burst nodes show as available, I'm guessing you are still using the monitor-force-up option when the resource module is reloaded. Try removing that option and your jobs submitted to the offline queue should be accepted to the queue but not scheduled.

vsoch commented 1 year ago

@grondo the flux getattr hostlist worked great! I can ask for fewer nodes, and then get the listing exactly to put into the config. Absolutely spot on.

Since the burst nodes show as available, I'm guessing you are still using the monitor-force-up option when the resource module is reloaded. Try removing that option and your jobs submitted to the offline queue should be accepted to the queue but not scheduled.

Hmm could this be a bug? I definitely removed that monitor-force-up e.g., see here but it says they are up but an :x: is next to avail:

     STATE QUEUE      NNODES   NCORES    NGPUS NODELIST
      free online          3        9        0 flux-sample-[1-3]
      free offline       100    10300        0 burst[0-99]
 allocated                 0        0        0 
      down                 0        0        0 
0 flux-sample-1: full
├─ 1 flux-sample-2: full
└─ 2 flux-sample-3: full
     STATE UP NNODES NODELIST
     avail  ✔      3 flux-sample-[1-3]
    avail*  ✗    100 burst[0-99]

Could it be the flag --setattr=exec.test.run_duration=1ms is still having them fake being up? If I remove it, however, I get that error shown above, e.g., a bunch of these:

Jun 28 18:22:27.869159 job-exec.err[0]: ƒ2eExs1q: exec_kill: any (rank 4294967295): No such file or directory
ƒ2c4wv4X: exception: type=exec note=lost contact with job shell on broker (null) (rank 97)

And the nodes are still not in the DOWN state.

You can't schedule a job without it attempting to be run (or simulated to be run as with the mock execution). Or do you mean submit a job and have it stay in the SCHED state (i.e. pending) while the burst nodes are down?

Yes that's exactly what we want! For bursting, we will have these potential nodes defined, and in the same way we add faux nodes to a starting broker (and can schedule a job that doesn't have current resources but doesn't fail) we want to be able to pass that on to a child broker (as in this use case).

grondo commented 1 year ago

Oh, do you have the latest flux-sched? The bug where all ranks but 0 were marked up when running an instance size that didn't match the fake resource count was only merged yesterday.

Could it be the flag --setattr=exec.test.run_duration=1ms

No, that attribute has nothing to do with scheduling, it just enables the mock execution implementation, which simulates a job being executed but doesn't run any job shells.

vsoch commented 1 year ago

Oh, do you have the latest flux-sched? The bug where all ranks but 0 were marked up when running an instance size that didn't match the fake resource count was only merged yesterday.

Oup, probably not! I will rebuild my base container and try with it.