Closed natefoo closed 1 year ago
Ok, confirmed, if I configure the following in the config file referenced by pulsar_app_config_path
in the destination config in Galaxy, then it fixes the issue:
---
message_queue_url: amqp://user:pass@host:5671//main_pulsar?ssl=1
managers:
tacc_k8s:
type: coexecution
monitor: background
Which is maybe not ideal but at least there's a workaround without the change in #314.
I've tried to get away from having to define managers at all in there and I'd like to stick with the default manager - I think https://github.com/galaxyproject/pulsar/pull/315 should fix that? You can just define amqp_key_prefix: pulsar_tacc_k8s_
as a destination parameter and it all should just work. I wanted to give you some time to check my work before I try to rebuild and test but I guess I should fast track testing and publishing this huh? I assume ITs are broken on main?
The type coexecution and background things shouldn't be needed anymore either - you should just use the PulsarCoexecutionJobRunner.
Okay @natefoo - I think this is all ready to go on the dev side - the TL;DR:
This looks perfect, thanks, I'll try it out. Yes, the reason I had a named manager was for AMQP purposes, having a way to specify the AMQP exchange but still use the default manager seems to me like the best solution for the coexecution case where named managers don't make sense.
In the absence of news I assume this is working now, but please re-open if that's not the case @natefoo
A non-default manager is necessary in this case because it's how we route messages to the correct Pulsar via AMQP.
Upon upgrading to usegalaxy.org, 23.0, Pulsar Kubernetes runner jobs are failing with the following error (where
tacc_k8s
is the manager defined in the runner plugin and is the destination id:usegalaxy.org had been running a custom coexecution image and unfortunately I no longer have any recollection as to why or how that image was built. But the version of Pulsar appears to be from somewhere around d9b710231cba65555285d38ae4da139ef5421395 and I can't see any differences that suggest I manually hacked a fix. But in local testing I can reproduce this error all the way back to very old 0.14.x versions, so I have been unable to figure out how this was working until now. This image might still work if not for the fact that the client is adding
--wait
to thepulsar-submit
args and the version ofpulsar-submit
in that image doesn't have--wait
.It may be possible/correct to set a pulsar app config on the Galaxy side that explicitly defines the
tacc_k8s
manager. The app conf generated forpulsar-submit
's--app_conf_base64
option for these jobs contains:I just wish I understood how this worked before.