Code is done, still needs documentation (pending outcome of internal Slack discussion of if it's a good idea to expose this option or not, given it saves late stochastic failures but at the expense of perhaps not noticing early systemic failures)
Added docs and a corresponding change to make JOB_RETRIES configurable from the config so that when ALWAYS_CONTINUE = True we can lower the number of JOB_RETRIES from the (now previously) hardcoded 10.
Code is done, still needs documentation (pending outcome of internal Slack discussion of if it's a good idea to expose this option or not, given it saves late stochastic failures but at the expense of perhaps not noticing early systemic failures)