It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
272 stars 21 forks source link

Generating access files in advance #595

Closed spirali closed 1 year ago

spirali commented 1 year ago

Solves #592. Read cloud.md for usage.

spirali commented 1 year ago

In the current implementation, it is not a problem to put PID back. I just want to leave it only with the connection information without details about running instance.

Kobzol commented 1 year ago

Ok, adding PID to server info so that it can be printed in CLI would be nice.

spirali commented 1 year ago

Pid and start time returned to "server info" (not into access.json)

vsoch commented 1 year ago

Awesome! I am going to try this out, and can report back.

vsoch commented 1 year ago

okay so it looks like that generate command needs to run on a server with the same hostname that it will be deployed on?

Access token found but HQ server hyperqueue-sample-access:6789 is unreachable.
Try to (re)start the server using `hq server start`

So if I want a fully qualified name I'll need the server itself to report FQDNs as well?

vsoch commented 1 year ago

oh just kidding, I see this!

 --host <HOST>
          Override target host name, otherwise local hostname is used

Will try that!

vsoch commented 1 year ago

okay one tiny tweak and (I think?) it might work - it looks like I can only define one host:

 --host <HOST>
          Override target host name, otherwise local hostname is used

However, the host and worker(s) have different addresses. Can we specify (akin to port) a worker and host port? Unless I'm setting this up incorrectly? Basically, we have different nodes entirely that will act as workers, and then a central server that starts everything up (and the workers connect to!)

vsoch commented 1 year ago

And assuming that the model is different and the server/worker port are supposed to running on the main node (for workers to connect to) when I start there it looks like this:

root@hyperqueue-sample-server-0-0:/app#     hq server start --access-file=./hq/access.json
2023-06-17T01:41:28Z INFO No online server found, starting a new server
2023-06-17T01:41:28Z INFO Storing access file as '/root/.hq-server/001/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server                                                              |
| Server UID       | Lqacwy                                                                        |
| Client host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port      | 6789                                                                          |
| Worker host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port      | 1234                                                                          |
| Version          | 0.15.0-dev                                                                    |
| Pid              | 425                                                                           |
| Start date       | 2023-06-17 01:41:28 UTC                                                       |
+------------------+-------------------------------------------------------------------------------+

but then starting the worker node (different hostname, the same but with -worker- instead of -server- I get:

# hq --server-dir=./hq worker start
2023-06-17T01:43:09Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T01:43:09Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T01:43:09Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:6789
2023-06-17T01:43:09Z INFO Listening on port 35213
2023-06-17T01:43:09Z INFO Connecting to server (candidate addresses = [10.244.0.61:6789])
Error: Authentication failed: Expected peer role server, got hq-server

It seems to want a peer role server? The worker is definitely hitting the main server, because I see:

2023-06-17T01:43:09Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:20Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:31Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker

I could try random stuffs but I will wait for you to advise! Thank you!

spirali commented 1 year ago

There are two ports, one for connecting clients and one for connecting workers.

An address to which workers and clients connects may different but it has point to be the same physical machine. I will add configuration options (--worker-host, --client-host). If you want to try it, for now you can just manually edit hostnames in generated access files. Use case: This option allows client connections from an outer network (sever has a public name) and worker connections only in inner network.

vsoch commented 1 year ago

I can try again, but I'm pretty sure I got the above error about wanting "peer role hq-client, got worker" when I changed the worker hostname manually for both.

vsoch commented 1 year ago

oh I think it works! I must have not done the right combination of things yesterday! okay so here is my main server, and I think this should say that clients connect to 6789 and workers 1234?

# cat hq/access.json 

{
  "version": "0.15.0-dev",
  "server_uid": "Lqacwy",
  "client": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 6789,
    "secret_key": "92abb20c99eca5085f5b3dcbdc4e5caa00074d31f33a23bc9edd53d1254ea8e8"
  },
  "worker": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 1234,
    "secret_key": "2bcc9b0a847d901a35ee23f8e1acbad98faef93671a2b99cb4b600681d98bf1b"
  }
}

Start it up like:

#     hq server start --access-file=./hq/access.json
2023-06-17T11:48:57Z INFO No online server found, starting a new server
2023-06-17T11:48:57Z INFO Storing access file as '/root/.hq-server/002/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server                                                              |
| Server UID       | Lqacwy                                                                        |
| Client host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port      | 6789                                                                          |
| Worker host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port      | 1234                                                                          |
| Version          | 0.15.0-dev                                                                    |
| Pid              | 428                                                                           |
| Start date       | 2023-06-17 11:48:57 UTC                                                       |
+------------------+-------------------------------------------------------------------------------+

And here is from my worker:

# cat hq/access.json 

{
  "version": "0.15.0-dev",
  "server_uid": "Lqacwy",
  "client": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 6789,
    "secret_key": "92abb20c99eca5085f5b3dcbdc4e5caa00074d31f33a23bc9edd53d1254ea8e8"
  },
  "worker": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 1234,
    "secret_key": "2bcc9b0a847d901a35ee23f8e1acbad98faef93671a2b99cb4b600681d98bf1b"
  }
}

Works!

root@hyperqueue-sample-worker-0-0:/app# hq worker start --server-dir=./hq
2023-06-17T11:53:47Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T11:53:47Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T11:53:47Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:1234
2023-06-17T11:53:47Z INFO Listening on port 42573
2023-06-17T11:53:47Z INFO Connecting to server (candidate addresses = [10.244.0.61:1234])
+-------------------+------------------------------------+
| Worker ID         | 1                                  |
| Hostname          | hyperqueue-sample-worker-0-0       |
| Started           | "2023-06-17T11:53:47.659056770Z"   |
| Data provider     | hyperqueue-sample-worker-0-0:42573 |
| Working directory | /tmp/hq-worker.H6L7KkcvvKAu/work   |
| Logging directory | /tmp/hq-worker.H6L7KkcvvKAu/logs   |
| Heartbeat         | 8s                                 |
| Idle timeout      | None                               |
| Resources         | cpus: 8                            |
|                   | mem: 15.32 GiB                     |
| Time Limit        | None                               |
| Process pid       | 463                                |
| Group             | default                            |
| Manager           | None                               |
| Manager Job ID    | N/A                                |
+-------------------+------------------------------------+

and I did (from the server):

hq submit echo hello world

and I think it ran?

# hq job list --all
+----+------+----------+-------+
| ID | Name | State    | Tasks |
+----+------+----------+-------+
|  1 | echo | FINISHED | 1     |
+----+------+----------+-------+
vsoch commented 1 year ago

AH and I just found the output on the worker node!

# cat job-1/0.stdout 
hello world
root@hyperqueue-sample-worker-0-0:/app# 

This is great! I did this run manually but next I'll have these steps be fully automated...

vsoch commented 1 year ago

okay we are in business! I added a retry loop to the worker, because often it can come up before the main server (and not be ready, and then it doesn't retry):

# Keep trying until we connect
until hq --server-dir=./hq worker start
do
    echo "Trying again to connect to main server..."
    sleep 2
done

But then we have them both running!

image

I built this branch into a custom container base since I couldn't wget the binary to just use, but next (a little later today after a bit more sleep) I will try running LAMMPS (this will work with mpi too?) If that is all good - then I say ship it! :rocket:

This is really exciting!

vsoch commented 1 year ago

okay it's all good! I was able to submit a job with --wait and then specify a --log file so I can cat it at the end, and we are in business! image

I say ship it - right now I'm building from a custom container with this branch, but after that we should be able to use the release here. Thank you so much for doing this! We are planning experiments that look at job managers in Kubernetes and (with this update) there is a very good chance we can include hq!

vsoch commented 1 year ago

Thank you for implementing this!