MetPX / sarrac

C implementation of (a subset of) Sarracenia (large scale file transfer utility)
GNU General Public License v2.0
4 stars 1 forks source link

post_retry support for mirrorring #171

Open petersilva opened 3 weeks ago

petersilva commented 3 weeks ago

The client has concerns about robustness of the mirroring post generation during broker outages. Currently, I think the user jobs will just hang, trying desperately to publish notices for the broker.

The post_retry logic (actually all retry logic) depends on having one retry list / process. Each instance has a one file per retry queue (download and post being extant currently.) In the context of libsr3shim... this does not make much sense. the processes are typically short-lived, non-daemons.

An alternative to the thousands of .pid files, would be to post to a pipe, or a named pipe, per node... in which case, you need a janitor that reads the named pipe. You end up creating a second IPC network to robustify your IPC network.

Taking the simpler option:

This is one suggested implementation.

petersilva commented 3 weeks ago

@reidsunderland @habilinour what I did not have time to explain during the meeting.

petersilva commented 3 weeks ago

avoiding contention is probably harder than that... you need to have hostname and pid combined.. because you might get pid conflicts... I was thinking we could check the proc table to avoid conflict... but would have to check the proc table on all nodes, or run the janitor on all nodes, which feels ridiculously expensive. That's why I was using 1 minute... might need a longer time.

I hope we can just run 1 janitor for the whole cluster. During peak times it falls behind, and catches up later... everything is late anyways... minutes don't matter in this situation.