Coexecution jobs don't work with non-default managers

natefoo commented 1 year ago

A non-default manager is necessary in this case because it's how we route messages to the correct Pulsar via AMQP.

Upon upgrading to usegalaxy.org, 23.0, Pulsar Kubernetes runner jobs are failing with the following error (where tacc_k8s is the manager defined in the runner plugin and is the destination id:

Traceback (most recent call last):
  File "/usr/local/bin/pulsar-submit", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit.py", line 20, in main
    run_server_for_job(args)
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit_util.py", line 29, in run_server_for_job
    manager, app = manager_from_args(config_builder)
  File "/usr/local/lib/python3.7/site-packages/pulsar/scripts/submit_util.py", line 94, in manager_from_args
    manager = pulsar_app.managers[manager_name]
KeyError: 'tacc_k8s'

usegalaxy.org had been running a custom coexecution image and unfortunately I no longer have any recollection as to why or how that image was built. But the version of Pulsar appears to be from somewhere around d9b710231cba65555285d38ae4da139ef5421395 and I can't see any differences that suggest I manually hacked a fix. But in local testing I can reproduce this error all the way back to very old 0.14.x versions, so I have been unable to figure out how this was working until now. This image might still work if not for the fact that the client is adding --wait to the pulsar-submit args and the version of pulsar-submit in that image doesn't have --wait.

It may be possible/correct to set a pulsar app config on the Galaxy side that explicitly defines the tacc_k8s manager. The app conf generated for pulsar-submit's --app_conf_base64 option for these jobs contains:

{
  "staging_directory": "/pulsar_staging",
  "message_queue_url": "amqp://user:pass@host:5671//main_pulsar?ssl=1",
  "manager": {
    "type": "coexecution",
    "monitor": "background"
  },
  "persistence_directory": "/pulsar_staging/persisted_data"
}

I just wish I understood how this worked before.

natefoo commented 1 year ago

Ok, confirmed, if I configure the following in the config file referenced by pulsar_app_config_path in the destination config in Galaxy, then it fixes the issue:

---

message_queue_url: amqp://user:pass@host:5671//main_pulsar?ssl=1
managers:
  tacc_k8s:
    type: coexecution
    monitor: background

Which is maybe not ideal but at least there's a workaround without the change in #314.

jmchilton commented 1 year ago

I've tried to get away from having to define managers at all in there and I'd like to stick with the default manager - I think https://github.com/galaxyproject/pulsar/pull/315 should fix that? You can just define amqp_key_prefix: pulsar_tacc_k8s_ as a destination parameter and it all should just work. I wanted to give you some time to check my work before I try to rebuild and test but I guess I should fast track testing and publishing this huh? I assume ITs are broken on main?

jmchilton commented 1 year ago

The type coexecution and background things shouldn't be needed anymore either - you should just use the PulsarCoexecutionJobRunner.

jmchilton commented 1 year ago

https://github.com/galaxyproject/galaxy/pull/15535

jmchilton commented 1 year ago

Okay @natefoo - I think this is all ready to go on the dev side - the TL;DR:

Update to the latest Galaxy.
Revert any changes that would cause a non-default Pulsar container to be used.
If your job runner is not PulsarKubernetesJobRunner (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/pulsar.py#L1001) switch to that.
In the runner - setup a unique amqp key the way you would have used a manager name in the past https://github.com/galaxyproject/galaxy/pull/15535/files#diff-69443828eb5971b7ccfbc3956de897b8667470501eb847cc02432723a07a4c75R197 - I would think this only needs to be unique per Galaxy but you've got it keyed on more things and I will just trust you.
Drop the managers key in the destination and its overrides - they shouldn't be needed for our configuration in 23.0 (stuff like saying this was a co-execution manager is no longer needed).

natefoo commented 1 year ago

This looks perfect, thanks, I'll try it out. Yes, the reason I had a named manager was for AMQP purposes, having a way to specify the AMQP exchange but still use the default manager seems to me like the best solution for the coexecution case where named managers don't make sense.

mvdbeek commented 1 year ago

In the absence of news I assume this is working now, but please re-open if that's not the case @natefoo

galaxyproject / pulsar

Coexecution jobs don't work with non-default managers #313