LLNL / merlin

Machine Learning for HPC Workflows
MIT License
119 stars 26 forks source link

Basic Workflow examples requires another service? #406

Open vsoch opened 1 year ago

vsoch commented 1 year ago

Hi! I'm trying to follow the tutorial - I've installed it, done merlin config (not included in the docs there) and also init'd the workflow. When I run the command as shown here https://merlin.readthedocs.io/en/latest/merlin_workflows.html I get

image

I think I'm possibly missing something - was I supposed to configure a service? Is there a complete tutorial with step by step instructions I could do somewhere? Thanks!

lucpeterson commented 1 year ago

Have you configured merlin per the instructions in the tutorial? https://merlin.readthedocs.io/en/latest/modules/installation/installation.html#id3 In particular, you should be running the "merlin server" command to get a container running a server

vsoch commented 1 year ago

I didn't know there was a tutorial! :laughing: I see redis now. Would it make sense to have a link to that on the workflows page? Like "before you do this, you should have followed the instructions here." I started at the workflows page, then went to the install page to pip install, and didn't realize there was an intermediate step.

vsoch commented 1 year ago

Is it possible to run redis just locally, or via an external service? I'm planning to run this on kubernetes, and the redis service will be another container in a pod, meaning the service will be available (and I don't want merlin to launch singularity or docker or podman). I'm reading here https://merlin.readthedocs.io/en/latest/merlin_server.html.

lucpeterson commented 1 year ago

Yes you can start your own redis server locally without a container or have one running with an external server and configure your configuration file to point to it. This can be complicated, so merlin server aims to make this easier. You can also run with the —local flag to bypass a server and run serially, which is sufficient for the basic example workflows

lucpeterson commented 1 year ago

Here are some instructions if you want to configure your container to run your own server: https://merlin.readthedocs.io/en/latest/modules/installation/installation.html#id5

lucpeterson commented 1 year ago

Advanced configuration docs are here: https://merlin.readthedocs.io/en/latest/merlin_config.html

lucpeterson commented 1 year ago

Running via Kubernetes would be pretty cool. If you figure out the configuration and/or docs, we should add it!

vsoch commented 1 year ago

The docker-compose example should work! For testing I'm just spinning up redis locally, but for the Flux Operator we have separate containers that provide services: https://flux-framework.org/flux-operator/tutorials/services.html. There are two modes (and likely I'll test and think about both). For one - the service is provided as a sidecar container. So an indexed job with N containers running merlin would have one sidecar per container. That wouldn't be ideal if all nodes running merlin (via flux) need access to the same database. The second design (which I haven't tested yet, but technically it's simpler than the above) is to bring up a single (shared) service for all the worker nodes to access. Glancing at these files, the hardest thing (for the Flux Operator) will be to generate the shared config files and volumes in advance. Likely I'll just do this manually for now and make a tutorial, but (if you like this idea generally) this could be an opportunity to make a simple Kubernetes operator just for running these merlin workflows. They are fairly fun to make! I'm also thinking about if it might be possible to have the concept of a Flux Operator plugin - e.g., adding an ability to use "Flux Operator + Merlin" and then having some of the complexity handled by the plugin. I'll do more research/reading about this today - editing audio first but should have some time later.

Happy Saturday!

vsoch commented 1 year ago

Heyo! I have my docker-compose setup and I think I'm fairly close - but a question. Is redis requested to use ssl? I haven't generated certificates for it, and I'm wondering if this is redis or rabbitmq (or possibly I have a configuration error). I also had some trouble matching the environment variables for the various cert files (they generate with very different names) so I might have messed that up. Here is the current debug output:

$ docker exec -it merlin bash
root@eda5eada61b7:/workflow# merlin -lvl DEBUG run feature_demo/feature_demo.yaml

       *      
   *~~~~~                                       
  *~~*~~~*      __  __           _ _       
 /   ~~~~~     |  \/  |         | (_)      
     ~~~~~     | \  / | ___ _ __| |_ _ __  
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \ 
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~                                    
   ~~~*~~~*    Machine Learning for HPC Workflows                                 

[2023-03-20 22:48:08: INFO] Loading specification from path: /workflow/feature_demo/feature_demo.yaml
[2023-03-20 22:48:08: DEBUG] Creating Merlin spec object...
[2023-03-20 22:48:08: DEBUG] Successfully loaded specification: 
{'name': '$(NAME)', 'description': 'Run 10 hello worlds.'}
[2023-03-20 22:48:08: DEBUG] Merlin spec object created.
[2023-03-20 22:48:08: DEBUG] Verifying Merlin spec...
[2023-03-20 22:48:08: DEBUG] Merlin spec verified.
[2023-03-20 22:48:08: DEBUG] Creating Merlin spec object...
[2023-03-20 22:48:08: DEBUG] Successfully loaded specification: 
{'name': 'feature_demo', 'description': 'Run 10 hello worlds.'}
[2023-03-20 22:48:08: DEBUG] Merlin spec object created.
[2023-03-20 22:48:08: DEBUG] Creating Merlin spec object...
[2023-03-20 22:48:08: DEBUG] Successfully loaded specification: 
{'name': 'feature_demo', 'description': 'Run 10 hello worlds.'}
[2023-03-20 22:48:08: DEBUG] Merlin spec object created.
[2023-03-20 22:48:08: DEBUG] Creating Merlin spec object...
[2023-03-20 22:48:08: DEBUG] Successfully loaded specification: 
{'name': 'feature_demo', 'description': 'Run 10 hello worlds.'}
[2023-03-20 22:48:08: DEBUG] Merlin spec object created.
[2023-03-20 22:48:08: INFO] Study workspace is '/workflow/studies/feature_demo_20230320-224808'.
[2023-03-20 22:48:08: INFO] Reading app config from file /root/.merlin/app.yaml
[2023-03-20 22:48:08: DEBUG] Broker: connection = amqps
[2023-03-20 22:48:08: DEBUG] Broker: vhost = root
[2023-03-20 22:48:08: DEBUG] Broker: username = root
[2023-03-20 22:48:08: DEBUG] Broker: server = rabbitmq
[2023-03-20 22:48:08: DEBUG] Broker: password filepath = /root/.merlin/rabbit.pass
[2023-03-20 22:48:08: DEBUG] Broker: RabbitMQ using default port = 5671
[2023-03-20 22:48:08: DEBUG] Broker: connection = amqps
[2023-03-20 22:48:08: DEBUG] Broker: vhost = root
[2023-03-20 22:48:08: DEBUG] Broker: username = root
[2023-03-20 22:48:08: DEBUG] Broker: server = rabbitmq
[2023-03-20 22:48:08: DEBUG] Broker: password filepath = /root/.merlin/rabbit.pass
[2023-03-20 22:48:08: DEBUG] Broker: RabbitMQ using default port = 5671
[2023-03-20 22:48:08: DEBUG] broker: amqps://root:******@rabbitmq:5671/root
[2023-03-20 22:48:08: DEBUG] Broker: keyfile not present
[2023-03-20 22:48:08: DEBUG] Broker: certfile not present
[2023-03-20 22:48:08: DEBUG] Broker: ca_certs not present
[2023-03-20 22:48:08: DEBUG] Broker: ssl cert_reqs not present
[2023-03-20 22:48:08: DEBUG] Broker: ssl ssl_protocol not present
[2023-03-20 22:48:08: DEBUG] broker_ssl = True
[2023-03-20 22:48:08: DEBUG] Results backend: redis using default password = 
[2023-03-20 22:48:08: DEBUG] Results backend: password_file = 
[2023-03-20 22:48:08: DEBUG] Results backend: server = redis
[2023-03-20 22:48:08: DEBUG] Results backend: certs_path = None
[2023-03-20 22:48:08: DEBUG] Results Backend: keyfile not present
[2023-03-20 22:48:08: DEBUG] Results Backend: certfile not present
[2023-03-20 22:48:08: DEBUG] Results Backend: ca_certs not present
[2023-03-20 22:48:08: DEBUG] Results Backend: ssl cert_reqs not present
[2023-03-20 22:48:08: DEBUG] Results Backend: ssl ssl_protocol not present
[2023-03-20 22:48:08: DEBUG] Results backend: redis using default password = 
[2023-03-20 22:48:08: DEBUG] Results backend: password_file = 
[2023-03-20 22:48:08: DEBUG] Results backend: server = redis
[2023-03-20 22:48:08: DEBUG] Results backend: certs_path = None
[2023-03-20 22:48:08: DEBUG] results: redis://redis:6379/0
[2023-03-20 22:48:08: DEBUG] results: redis_backed_use_ssl = False
[2023-03-20 22:48:08: INFO] Overriding default celery config with 'celery.override' in 'app.yaml':
    visibility_timeout: 86400
[2023-03-20 22:48:08: DEBUG] Adapter config = {'type': 'local', 'dry_run': False, 'shell': '/bin/bash', 'batch_type': 'local'}
[2023-03-20 22:48:10: DEBUG] Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/merlin/main.py", line 898, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/merlin/main.py", line 191, in process_run
    router.run_task_server(study, args.run_mode)
  File "/opt/conda/lib/python3.10/site-packages/merlin/router.py", line 72, in run_task_server
    run_celery(study, run_mode)
  File "/opt/conda/lib/python3.10/site-packages/merlin/study/celeryadapter.py", line 64, in run_celery
    app.connection().connect()
  File "/opt/conda/lib/python3.10/site-packages/kombu/connection.py", line 274, in connect
    return self._ensure_connection(
  File "/opt/conda/lib/python3.10/site-packages/kombu/connection.py", line 433, in _ensure_connection
    return retry_over_time(
  File "/opt/conda/lib/python3.10/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
    return fun(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/kombu/connection.py", line 877, in _connection_factory
    self._connection = self._establish_connection()
  File "/opt/conda/lib/python3.10/site-packages/kombu/connection.py", line 812, in _establish_connection
    conn = self.transport.establish_connection()
  File "/opt/conda/lib/python3.10/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
    conn.connect()
  File "/opt/conda/lib/python3.10/site-packages/amqp/connection.py", line 323, in connect
    self.transport.connect()
  File "/opt/conda/lib/python3.10/site-packages/amqp/transport.py", line 130, in connect
    self._init_socket(
  File "/opt/conda/lib/python3.10/site-packages/amqp/transport.py", line 209, in _init_socket
    self._setup_transport()
  File "/opt/conda/lib/python3.10/site-packages/amqp/transport.py", line 404, in _setup_transport
    self.sock.do_handshake()
  File "/opt/conda/lib/python3.10/ssl.py", line 1342, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)

[2023-03-20 22:48:10: ERROR] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)

Here is my app.yaml

broker:
  name: rabbitmq
  server: rabbitmq
  password: /root/.merlin/rabbit.pass
celery:
  override:
    visibility_timeout: 86400
process:
  kill: kill {pid}
  status: pgrep -P {pid}
results_backend:
  db_num: 0
  name: redis
  port: 6379
  server: redis

and my docker-compose:

version: '3.9'

services:

  # This can also be set up with TLS
  # https://merlin.readthedocs.io/en/latest/modules/installation/installation.html#id7
  # see 2.4.2 "Redis TLS Service"
  redis:
    restart: always
    hostname: redis
    container_name: redis
    image: 'redis:latest'
    ports:
      - "6379:6379"
    networks:
      - rabbitmq

  rabbitmq:
    restart: always

    # The hostname should not be necessary - didn't work either way
    hostname: rabbitmq
    container_name: rabbitmq
    image: rabbitmq:3.8-management
    ports:
      - "15672:15672"
      - "15671:15671"
      - "5672:5672"
      - "5671:5671"
    environment:
      - RABBITMQ_SSL_CACERTFILE=/cert_rabbitmq/client_rabbitmq_certificate.pem
      - RABBITMQ_SSL_KEYFILE=/cert_rabbitmq/server_rabbitmq_key.pem
      - RABBITMQ_SSL_CERTFILE=/cert_rabbitmq/server_rabbitmq_certificate.pem
      - RABBITMQ_SSL_VERIFY=verify_none
      - RABBITMQ_SSL_FAIL_IF_NO_PEER_CERT=false
      - RABBITMQ_DEFAULT_USER=merlinu
      - RABBITMQ_DEFAULT_VHOST=/merlinu
      - RABBITMQ_DEFAULT_PASS=guest
    volumes:
      - ./merlinu/cert_rabbitmq:/cert_rabbitmq
    networks:
      - rabbitmq

  # Yer a weezard Harry!
  merlin:
    build: .
    container_name: merlin
    networks:
      - rabbitmq

    # I only added these because they weren't showing up (didn't change anything)
    # You can try with them removed
    links:
      - rabbitmq 
      - redis

networks:
  rabbitmq:
    driver: bridge

Also note that we have to now pin the rabbitmq management container to that version -if you go higher you'll get an error because they changed it to not accept envars (and only accept a config file).

Thanks for your help - I think I'm close (and am excited to at least try this out in the flux operator, once it's working in compose!)

lucpeterson commented 1 year ago

@koning or @ryannova do you have any ideas? Is this related to the weird conda SSL issue that pops up every once in a while (and I’m not sure what the fix is)? How is the “merlin server” container configured, compared to this setup?

vsoch commented 1 year ago

@lucpeterson I just tested removing conda/mamba from the container (and installing to system python alongside flux) and I reproduced the same ssl error, if that helps.

lucpeterson commented 1 year ago

I’ve seen a similar error with incompatibilities between versions of celery/kombu and certifi. Trying different versions of certifi could maybe work?

koning commented 1 year ago

The broker lists as amqps so it is expecting a ssl config, you can try setting the broker to amqp or setting cert_reqs: none . The rabbitmq server is also configured with certs so it may need to have ssl configured for the initial handshake. The TSL configuration is definitely the most complicated component of the server system and we have seen issues with specific versions of openssl.

vsoch commented 1 year ago

I haven't added the redis ssl yet - it was presented like it was optional in the docs. I can do that now.

koning commented 1 year ago

I think the rabbitmq broker is giving you the exception based on the amqp transport.py file emitting the exception.

vsoch commented 1 year ago

I'm also seeing two strategies for rabbitmq - the first docker-compose example shows config via envars (which is deprecated around 3.8/3.9)

image

and the second (which would work for newer versions) is using a config file (below). I can try the latter (which I haven't tried yet). image

vsoch commented 1 year ago

okay trying this new way - have you seen this before? Here is what redis sees inside the container:

redis  | total 32
redis  | -rw-rw-r-- 1 1000 1000 1281 Mar 21 18:25 ca_certificate.pem
redis  | -rw------- 1 1000 1000 1704 Mar 21 18:25 ca_key.pem
redis  | -rw------- 1 1000 1000 3437 Mar 21 18:25 client_redis.p12
redis  | -rw-rw-r-- 1 1000 1000 1253 Mar 21 18:25 client_redis_certificate.pem
redis  | -rw------- 1 1000 1000 1708 Mar 21 18:25 client_redis_key.pem
redis  | -rw------- 1 1000 1000 3501 Mar 21 18:25 server_redis.p12
redis  | -rw-rw-r-- 1 1000 1000 1338 Mar 21 18:25 server_redis_certificate.pem
redis  | -rw------- 1 1000 1000 1704 Mar 21 18:25 server_redis_key.pem

When I use the entrypoint command:

    command:
      - --port 0
      - --tls-port 6379
      - --tls-ca-cert-file /cert_redis/ca_certificate.pem
      - --tls-key-file /cert_redis/server_redis_key.pem
      - --tls-cert-file /cert_redis/server_redis_certificate.pem
      - --tls-auth-clients no

I get permission denied

redis  | 1:M 21 Mar 2023 18:45:24.753 # Failed to configure TLS. Check logs for more info.
redis  | 1:C 21 Mar 2023 18:45:26.884 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis  | 1:C 21 Mar 2023 18:45:26.884 # Redis version=7.0.9, bits=64, commit=00000000, modified=0, pid=1, just started
redis  | 1:C 21 Mar 2023 18:45:26.884 # Configuration loaded
redis  | 1:M 21 Mar 2023 18:45:26.885 # Failed to load private key: /cert_redis/server_redis_key.pem: error:0200100D:system library:fopen:Permission denied
redis  | 1:M 21 Mar 2023 18:45:26.885 # Failed to configure TLS. Check logs for more info.

And that doesn't make sense because the user in the container is root. When I change ownership of that directory to uid 0 it doesn't change the error.

doutriaux1 commented 1 year ago

@vsoch is it possible for you to use launchit ? @bgunnar5 helped me set this up at the lab and it was really smooth.

vsoch commented 1 year ago

No - this would be entirely external to lab resources. Is there someone I can contact that has the Dockerfiles / Kubernetes configs that are driving the working variant?

vsoch commented 1 year ago

Great news! I banged on this a bit more and checked all the things I thought might be wrong (user ids, permissions, configs) and rabbitmq and redit are working! The next bug I hit is with respect to running the demo - it looks like there should be a samples file (but I don't see it):

image

I think maybe there was a silent error - possibly me forgetting to install something called spellbook? (I found this in the feature demo YAML file and tried running it on my own:

image

I think I found it here: https://github.com/LLNL/merlin-spellbook

Then after I installed that, it made some output:

# spellbook make-samples -n 5 -outfile=samples.npy
[[0.35185658 0.69877856]
 [0.04317754 0.27507853]
 [0.88595382 0.67272835]
 [0.00590664 0.07649303]
 [0.63696522 0.45457279]]

I think that might have worked?

image

Do I look at the output file?

# cat studies/feature_demo_20230322-035214/merlin_info/cmd.out 
[[0.78055381 0.02167573]
 [0.34246986 0.81569574]
 [0.12497185 0.54133217]
 [0.23107361 0.2192876 ]
 [0.14349186 0.96902287]
 [0.78745147 0.3541445 ]
 [0.48053942 0.23776201]
 [0.75213343 0.75142001]
 [0.82011063 0.78713369]
 [0.35126391 0.2416619 ]]

I'm not sure if that is right because the workflow description says it will launch 10 hello worlds (I can't find them!)

I then found that I can define the batch type as "flux"

batch:
-    type: local
+    type: flux

And then start the flux instance:

$ # sudo -u fluxuser -E PATH=$PATH -E PYTHONPATH=$PYTHOPATH -E LD_LIBRARY_PATH=$LD_LIBRARY_PATH flux start --test-size=4
$ whoami
fluxuser

And I ran it!

$ merlin run feature_demo/feature_demo.yaml

And that launched flux jobs!

root@a0bda8009b53:/workflow# flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
   ƒ27zGb3bd root     merlin     CD      1      1   1.028s a0bda8009b53
    ƒjiYAoTm root     merlin     CD      1      1   1.264s a0bda8009b53

And the job attach (output) showed me a warning that (bad me!) I should not be running as root. So this is a good stopping point for tonight, but next I need to go back and redo the build so ownership and the expected user is fluxuser (and not root) otherwise I get that warning:

# flux job attach ƒ27zGb3bd
[2023-03-22 03:59:10: INFO] Loading specification from path: /workflow/feature_demo/feature_demo.yaml
[2023-03-22 03:59:10: INFO] Launching workers from '/workflow/feature_demo/feature_demo.yaml'
[2023-03-22 03:59:10: INFO] Starting workers
[2023-03-22 03:59:10: INFO] Reading app config from file /root/.merlin/app.yaml

       *      
   *~~~~~                                       
  *~~*~~~*      __  __           _ _       
 /   ~~~~~     |  \/  |         | (_)      
     ~~~~~     | \  / | ___ _ __| |_ _ __  
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \ 
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~                                    
   ~~~*~~~*    Machine Learning for HPC Workflows                                 

Running a worker with superuser privileges when the
worker accepts messages serialized with pickle is a very bad idea!

If you really want to continue then you have to set the C_FORCE_ROOT
environment variable (but please think about this before you do).

User information: uid=0 euid=0 gid=0 egid=0

But this is great progress - we have our services! Here are the changes I needed and my WIP demo here. I'll update the group email thread, and tomorrow will start figuring out the fluxuser port. If I can get this running here with the fluxuser, that should be enough to try in the flux operator. Some feedback for the docs - the rabbitmq config parameters are different, and the certs had to be bound to the merlin container too. What really tripped me up was the app.yaml because there were so many options, and I didn't realize "rediss" means "redis with ssl" as opposed to "Did a snake write this?" :laughing:

I'll ping again after some more work tomorrow. Thanks for the help and connecting me to the larger team today! The merlin info command (that I saw in the last email was hugely helpful for debugging.

vsoch commented 1 year ago

okay went a little further - when I try the flux example workflows, it's still trying to use srun:

image

The cmd.sh in the studies / merlin_info looks OK

python3 /workflow/flux/scripts/make_samples.py -dims 2 -n 10 -outfile=/workflow/studies/flux_test_20230322-053246/merlin_info/samples.npy

I did some debugging, and in study/batch.py by the time we get here the workload manager is set to slurm!

And then I realized the pip installed version doesn't even have that logic!

image

So I reinstalled directly from the branch here - that actually seemed to run! I had to look here https://github.com/LLNL/merlin/blob/060826fee4a91502a588fc312a8f21527472e53c/merlin/study/batch.py#L331-L338 to figure out what is going on - it's launching via flux mini alloc, so there is a flux instance / allocation running merlin. I can attach to it (and see there is an issue). It looks like it runs ok - most are status OK but I do see a SOFT FAIL

Restart: None
Scheduled?: True
[2023-03-22 05:56:58,151: INFO] Executing step 'runs' in '/workflow/studies/flux_test_20230322-055655/runs/09'...
[2023-03-22 05:56:58,337: INFO] Execution returned status OK.
[2023-03-22 05:56:58,337: INFO] Step 'runs' in '/workflow/studies/flux_test_20230322-055655/runs/09' finished successfully.
[2023-03-22 05:56:58,394: INFO] Task merlin.common.tasks.merlin_step[4838c7da-7ea5-4aad-bd0f-719f83b94ede] succeeded in 0.24390821799170226s: ReturnCode.OK
[2023-03-22 05:56:58,396: INFO] Task merlin:chordfinisher[4a7c517b-76c9-4410-a108-2de2a3634019] received
[2023-03-22 05:56:58,408: INFO] Task merlin:chordfinisher[4a7c517b-76c9-4410-a108-2de2a3634019] succeeded in 0.01129624602617696s: 'SYNC'
[2023-03-22 05:56:58,409: INFO] Task merlin.common.tasks.expand_tasks_with_samples[1d54784b-e69f-4a1e-a53e-2c7f1c2404b4] received
[2023-03-22 05:56:58,421: INFO] Task merlin.common.tasks.expand_tasks_with_samples[1d54784b-e69f-4a1e-a53e-2c7f1c2404b4] succeeded in 0.011088147992268205s: None
[2023-03-22 05:56:58,422: INFO] Task merlin.common.tasks.merlin_step[5432e58a-a375-4388-a2b7-3e10766ba722] received
[2023-03-22 05:56:58,423: INFO] Directory does not exist. Creating directories to /workflow/studies/flux_test_20230322-055655/data
[2023-03-22 05:56:58,423: INFO] Generating script for data into /workflow/studies/flux_test_20230322-055655/data
[2023-03-22 05:56:58,423: INFO] Running workflow step 'data' locally.
[2023-03-22 05:56:58,423: INFO] Script: /workflow/studies/flux_test_20230322-055655/data/data.slurm.sh
Restart: None
Scheduled?: True
[2023-03-22 05:56:58,423: INFO] Executing step 'data' in '/workflow/studies/flux_test_20230322-055655/data'...
[2023-03-22 05:56:58,503: WARNING] Unrecognized Merlin Return code: 1, returning SOFT_FAIL
[2023-03-22 05:56:58,503: WARNING] *** Step 'data' in '/workflow/studies/flux_test_20230322-055655/data' soft failed. Continuing with workflow.
[2023-03-22 05:56:58,514: INFO] Task merlin.common.tasks.merlin_step[5432e58a-a375-4388-a2b7-3e10766ba722] succeeded in 0.09122752398252487s: ReturnCode.SOFT_FAIL
[2023-03-22 05:56:58,515: INFO] Task merlin:chordfinisher[bc64e65c-f988-4401-85d8-d22e710a9517] received
[2023-03-22 05:56:58,516: INFO] Task merlin:chordfinisher[bc64e65c-f988-4401-85d8-d22e710a9517] succeeded in 0.0006944899796508253s: 'SYNC'

Also note that "flux mini" is getting deprecated - so should update that eventually (not soon if you want backwards compatibility).

fluxuser@f47514a02ad9:/workflow$ flux-mini: WARNING: ⚠️ flux-mini is deprecated, use flux-batch, flux-run, etc.⚠️

I'm also wondering if it always makes sense to run merlin via an allocation? E.g., for my use case, I'm going to be giving the merlin command to flux start. Theoretically it will already be inside a flux instance, so it could just do flux submit. Maybe that launch command for flux should be more customizable? I also think the merlin run -> merlin run-workers command is a bit confusing for a new user - my expectation is that "run" actually runs the workflow. Perhaps there could be two avenues:

# queue the tasks and then separately run workers
merlin queue -> merlin run-workers

# queues AND runs the workers (so I have a single command that can do both for automation stuffs)
merlin run

My branch is now updated with the changes I needed to run as the fluxuser. https://github.com/rse-ops/flux-hpc/tree/add/merlin/merlin-demos. Will try out the other flux examples tomorrow!

bgunnar5 commented 1 year ago

Based on other issues I've been helping users with and now these suggestions too, I think it's time for us to update our script adapters so that they're more consistent with Maestro and up-to-date with flux/slurm/lsf (this would also include updates to the docs to make everything more clear for users). I appreciate your recommendations here and I'll be bookmarking this for when I go take a look at making these updates :relaxed:

koning commented 1 year ago

Thanks for beta testing the new flux native interface. We don't have the same version you used installed yet so the interface change is news to me, we will get a fix in for that. The run and run-workers are generally separate because you can start workers on a different machine than the study was submitted. That way you can spin-up more workers independent from the study, the producer-consumer model. Separating the scheduler batch/run configuration from the code would make these api changes much easier to implement.

koning commented 1 year ago

The new pr #407 should fix the deprecation messages.

vsoch commented 1 year ago

Gotcha! And that makes sense. For the operator we have a commands -> pre block where I can run it before the official "launch the jobs!" command. I'm pretty far into that now - hopefully will have an update soon. I kind of cheated for the redis/rabbitmq containers because I just built them already with the certs they need for the demo. :laughing: In a real production sense you'd want to generate them dynamically and then have read only config map volumes. You could even have a merlin operator to handle this!

vsoch commented 1 year ago

I sent this update via email, but will post here too! I got it mostly working in the flux operator - I had to do an interactive submit mode because (as far as I can tell) there is no single command to give to, for example, flux start that will generate some DAG, submit and wait for all jobs, and exit only when that's done.

https://flux-framework.org/flux-operator/tutorials/services.html#service-containers-alongside-the-cluster

I have a lot of questions about design (and am wondering if things might be simplified) so I'm hoping there is interest to have a meeting so you can show me some of the design internals / interaction with Flux.

lucpeterson commented 1 year ago

Longer term a cleaner interface might be to create a flux transport channel that celery can hook into directly. Currently celery can do rabbitmq, redis, sqs and zookeeper (although we only have merlin hooks for the first two) as brokers (there are lots more backends). adding a flux transport for the kombu library could make this a lot cleaner:

https://github.com/celery/kombu/tree/a3de6f66c1c62cba5008f078c2df20d97f32dcbe/kombu/transport

vsoch commented 1 year ago

@lucpeterson I like that idea - but how could kombu accept a contribution for a different kind of transport that explicitly is to a job queue (and isn't a general message or event?) Would we try to make some kind of additional plugin to work with it (or similar?)

vsoch commented 1 year ago

There are other abstractions to think about too - e.g.,a celery "backend" is more of the database. Here is a random example I found for a custom one. https://github.com/pilwon/celery-backends-rethinkdb/blob/master/rethinkdb_backend.py Arguably if we submit a job to flux, it would serve as running the task and be able to give us the result.

I've never developed for kombu / celery so apologies as I try to get my head around the different components (and what we are interested in).