galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 49 forks source link

curl remote transfers have no timeout #362

Open natefoo opened 2 months ago

natefoo commented 2 months ago

If I am reading the code correctly, when Pulsar is configured with remote_transfer and client.transport.curl is used, this uses the get_file() and post_file() functions, but not the PycurlTransport class, which has a timeout option defined, although that timeout option probably does not work the way we would want it to.

I have a Pulsar server which appears to have had a network interruption at some point. AMQP recovered and subsequent jobs are handled fine, but ones that were preprocessing during the hiccup are just stuck and have written 0 bytes to disk in the many hours since. Preprocessing thread stacks are:

Thread 101101 (idle): "[manager=vgp_jetstream2]-[action=preprocess]-[job=57400350]"
    get_file (pulsar/client/transport/curl.py:119)
    write_to_path (pulsar/client/action_mapper.py:482)
    <lambda> (pulsar/managers/staging/pre.py:20)
    _retry_over_time (pulsar/managers/util/retry.py:93)
    execute (pulsar/managers/util/retry.py:42)
    preprocess (pulsar/managers/staging/pre.py:20)
    do_preprocess (pulsar/managers/stateful.py:114)
    run (threading.py:917)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)

I would expect TCP to do something more reasonable here so that part is a mystery to me, but we could probably also allow the configuration of the low speed limit and low speed time options to mitigate these sorts of issues.