galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 49 forks source link

Large stdout/stderr crashes the acknowledgement manager, results in stuck jobs #283

Open natefoo opened 2 years ago

natefoo commented 2 years ago
2021-08-24 13:52:32,217 DEBUG [pulsar.client.amqp_exchange][acknowledgement-manager] UUID b694576c-02bf-11ec-a39e-566f6d94001a has not been acknowledged, republishing original message on queue status_update
2021-08-24 13:52:32,217 DEBUG [pulsar.client.amqp_exchange][acknowledgement-manager] [publish:0e74b094-0504-11ec-9a41-566f6d94001a] Begin publishing to key pulsar_bridges__status_update
2021-08-24 13:52:32,218 DEBUG [pulsar.client.amqp_exchange][acknowledgement-manager] [publish:0e74b094-0504-11ec-9a41-566f6d94001a] Have producer for publishing to key pulsar_bridges__status_update
2021-08-24 13:52:32,300 ERROR [pulsar.client.amqp_exchange][acknowledgement-manager] Problem with acknowledgement manager, leaving ack_manager method in problematic state!
Traceback (most recent call last):
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/pulsar/client/amqp_exchange.py", line 232, in ack_manager
    self.publish(resubmit_queue, payload)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/pulsar/client/amqp_exchange.py", line 205, in publish
    producer.publish(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/kombu/messaging.py", line 175, in publish
    return _publish(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/kombu/connection.py", line 525, in _ensured
    return fun(*args, **kwargs)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/kombu/messaging.py", line 197, in _publish
    return channel.basic_publish(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/channel.py", line 1775, in _basic_publish
    self.connection.drain_events(timeout=0)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/connection.py", line 522, in drain_events
    while not self.blocking_read(timeout):
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/connection.py", line 528, in blocking_read
    return self.on_inbound_frame(frame)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/method_framing.py", line 53, in on_frame
    callback(channel, method_sig, buf, None)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/connection.py", line 534, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/abstract_channel.py", line 143, in dispatch_method
    listener(*args)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/channel.py", line 277, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: Basic.publish: (406) PRECONDITION_FAILED - message size 155238738 is larger than configured max size 134217728

Not sure of the best solution here but maybe Pulsar should post the stdio streams back as job files rather than sending them in the MQ?

mvdbeek commented 1 year ago

We do (also) send it as a file, and pulsar has the maximum_stream_size option ... which defaults to -1, i.e. read everything. I think we can add a more sensible default here.

natefoo commented 1 year ago

I have that set to 8 MB, it seems to work fine.

cat-bro commented 3 months ago

Wouldn't truncating the stdout/stderr files affect galaxy's ability to judge success/failure of the job?

mvdbeek commented 3 months ago

We do (also) send it as a file,

covers guessing the job state if the exit code is not the authoritative source. that happens in the metadata script, which read in the file contents