Parallelism not working with HPC load-balancer

jonmaddock commented 3 days ago

I wonder if I might be doing something wrong, but it appears that all of my model evaluations are occurring in serial when I believe they should be occurring in parallel.

If I define my allocation_queue.sh as:

./hq alloc add slurm --time-limit 20m \
                   --idle-timeout 5m \
                   --backlog 1 \
                   --workers-per-alloc 4 \
                   --max-worker-count 4 \
                   --cpus=4 \
                   -- \
...
# other account/partition definitions

and job.sh as:

#HQ --cpus=1
#HQ --time-request=3m
#HQ --time-limit=5m
...

I appear to get 4 workers when running the load-balancer. I then submit some model evaluations using a client which uses the QMCPy UM-Bridge wrapper:

...
integrand = qp.integrand.UMBridgeWrapper(
    true_measure=true_measure, model=model, config=config, parallel=4
)
...

Some jobs are seen to start on the load-balancer, but after a while workers 2, 3 and 4 are all closed (at exactly the idle timeout time). The evaluations are slow and regular. I ssh-ed into each node when all workers were running and found with top that only the first node (i.e. worker 1) had any activity on it. I seem to get the same behaviour with different-sized allocations.

Does anyone have any ideas why this might not be evaluating in parallel, and what I might have done wrong? Thanks!

linusseelinger commented 3 days ago

This looks quite reasonable to me. (@Schlevidon @chun9l do you see anything obvious?)

On the current main branch, we now have a purely SLURM backend of the load balancer available (see https://um-bridge-benchmarks.readthedocs.io/en/docs/umbridge/hpc.html ). Instead of a HQ script, you'll have to set up an equivalent plain SLURM script following the given example. Could you give this a try? That helps diagnose if it's a problem with your HQ configuration or something else (e.g. the client not sending parallel requests).

jonmaddock commented 3 days ago

@linusseelinger thanks for this. I pulled the latest main, rebuilt the load-balancer and created a SLURM job.sh. But now when I try to run load-balancer I get:

$ ./load-balancer --scheduler=slurm
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: directory iterator cannot open directory: No such file or directory [urls]
Aborted

This wasn't happening before. I have gcc 13.3.0, and it built successfully. I took a guess and created a urls dir, and the error disappeared. Then I get:

$ ./load-balancer --scheduler=slurm
Argument --port not set! Using port 4242 as default.
Waiting for URL file: urls/url-64901833.txt

... and nothing else happens; it doesn't appear to ever start listening on the port for the client.

The job.sh script appears to run fine by itself when submitted via sbatch. Do you have any other ideas? Thank you for your support with this.

linusseelinger commented 3 days ago

@jonmaddock not quite sure what's going on here... how about we continue this on slack? (invite link on main page of umbridge readthedocs page). Might be faster to schedule a quick meeting to look through this together!

Schlevidon commented 3 days ago

@jonmaddock The original issue might have something to do with the --cpus=4 parameter in your allocation_queue.sh. Resource specifications apply to each individual worker created by the allocation queue. In your case, HyperQueue thinks that each worker has 4 CPUs and, since each job only needs 1 CPU, HyperQueue can schedule up to 4 jobs simultaneously onto a single worker. So I think that setting --cpus=1 in your allocation_queue.sh might fix the issue.

In general, specifying resources in HyperQueue can be a bit tricky (the SLURM backend is simpler in this regard). I recommend reading the section on resource specification in the README if you want to keep using the HyperQueue backend.

chun9l commented 3 days ago

Ah, Lev you beat me to it. Just about to say the same thing...

I think it's also worth a try to specify the CPU resources in the SLURM section of this command, e.g., -- --ntasks=4.

If this doesn't fix it, then perhaps it's due to the QMCpy wrapper?

jonmaddock commented 1 day ago

Thanks for everyone's (rapid!) comments. I spoke to @linusseelinger via Slack and he clarified my misunderstanding: the allocation_queue.sh specifies how many workers you have at once, but the resource requirements in that (i.e. --cpus=4) correspond to the resources per worker, not in total. The job.sh specifies the requirements of each job that will run on each worker, which in the simplest case I understand to be 1 job per worker. I hope that's now correct.

Previously, I was specifying --workers-per-alloc 4 and --cpus=4: 4 workers each with 4 cpus. As I specified parallel=4 in my client, I still don't understand why this wasn't running in parallel originally (4 concurrent workers, each with 4 CPUs?).

Retrying with Hyperqueue

I tried setting --cpus=1 (as per @Schlevidon 's suggestion): 4 workers with 1 cpu each, as suggested, but this made no difference; the evaluations still occurred in serial. Below I show that workers 1, 2 and 3 all idle-timeout despite jobs being left unevaluated, and only worker 4 is evaluating:

Waiting for URL file: urls/url-25.txt
2024-11-19T18:26:22Z INFO Job 25 canceled (1 tasks canceled, 0 tasks already finished)
Waiting for URL file: urls/url-26.txt
2024-11-19T18:26:24Z INFO Worker 2 connection closed (connection: 10.43.81.33:51850)
2024-11-19T18:26:24Z INFO Worker 3 connection closed (connection: 10.43.81.34:35856)
2024-11-19T18:26:27Z INFO Worker 1 connection closed (connection: 10.43.81.31:52942)
2024-11-19T18:26:35Z INFO Job 26 canceled (1 tasks canceled, 0 tasks already finished)
Waiting for URL file: urls/url-27.txt

I also tried @chun9l 's suggestion of -- --ntasks=4, but that made no difference.

Retrying with SLURM

I then retried the SLURM scheduler with ntasks=1, as above; this then ran (before there was some file I/O delay on the cluster), but again only in serial.

Conclusion

It appears that I can't get evaluations to run in parallel, whether using Hyperqueue or SLURM. All of the above tests were run with parallel=4 in the QMCPy-UM-Bridge wrapper. I hope that might help in diagnosing this problem further. I really hope I'm doing something stupid, but I can't seem to figure this one out any further. Thanks!

linusseelinger commented 1 day ago

This is getting spooky, the SLURM backend should be extremely simple and robust when it comes to parallel execution of model instances...

You could try without QMCpy, just to be sure there's no problem on the client side. You can trigger simulation runs via curl commands, such as the following ("name" is the name of your model defined in the server, "input" is your list of input vectors). Could you check if running parallel curl commands gives you parallel jobs when combined with the SLURM backend?

curl http://localhost:4242/Evaluate -X POST -d '{"name": "forward", "input": [[100.0, 50.0]], "config":{"vtk_output": true}}'

jonmaddock commented 1 day ago

Yes, this works: I can get parallel SLURM jobs using multiple curl commands. Good! So the problem must lie in my client. It's like the parallel argument doesn't have any effect; I get the same serial evaluations when it's False, True or 4:

integrand = qp.integrand.UMBridgeWrapper(
    true_measure=true_measure, model=model, config=config, parallel=4
)

linusseelinger commented 1 day ago

Ok, so seems like it's not our fault ;) Not sure what's going on here, I don't see any obvious changes in QMCpy that would break this... Do you rely on QMCpy, or is this just for testing? Might be worth an issue at https://github.com/QMCSoftware/QMCSoftware

jonmaddock commented 6 hours ago

This is for testing primarily; I don't mind not using QMCPy, but I was just following the parallel example in the tutorial (https://um-bridge-benchmarks.readthedocs.io/en/docs/tutorial.html#parallelized-uq). I just need to sample in parallel to begin with, otherwise I'm just going to have to go back to writing my own array jobs!

It's not clear to me how I can sample in parallel using any of the other clients in your documentation (https://um-bridge-benchmarks.readthedocs.io/en/docs/umbridge/clients.html#clients); do I need to make a multiprocessing pool and use the Python client, for example? Perhaps I'm missing something obvious.

UM-Bridge / umbridge