MetPX / sarrac

C implementation of (a subset of) Sarracenia (large scale file transfer utility)
GNU General Public License v2.0
4 stars 1 forks source link

hard ordering constraints in scripts in mirrors. #174

Open petersilva opened 3 weeks ago

petersilva commented 3 weeks ago

It is a clear limitation of the entire method of mirroring that order of operations are not strictly maintained. Files are posted, mostly in the order the changes are created, then the events are queued. There are subscribers that pick from the queue and in the case of HPC mirroring, the queue is shared among 40 subscribers... so any event can be in the queue for any of the subscribers.

Since different subscribers can pick up different events, and the subscribers are asynchronous to each other, the events can be executed on the subscriber side out of order with respect to the source. This "out of order" ness is intrinsic to the copying algorithm. Getting rid of out of order will force sequential operations which is expected to have a large performance impact... both in sychnronization primitives among the subscribers, and in reduction of operations that can proceed in parallel.

Note that, between the posting layer and the subscribing layer, there is the winnow layer. The winnow layer delays copies for 30 seconds, and squashes multiple changes to the same file so that only a single update is produced (if the file is re-written 10 times in 30 seconds, a post will be published for that file 30 seconds after the last write.) This is to de-noise the copies... so extremely transient files are not copied.

So we have a script. it does:


rm X
ln -sf y X

but the mirroring logic can re-arrange things so that the ln arrives and is executed before the rm. so instead of getting a new contents of X linked to y, X is just gone.

petersilva commented 3 weeks ago

one approach:

OK.. so in the first example, the rm is useless... make a new example (assum X starts out as a directory):


mv X Z
ln -sf y X

Assuming normal Linux rules:

petersilva commented 3 weeks ago

looking at the logic in the subscriber... it seems like it is always -f... it removes the existing file before creating the link.

https://github.com/MetPX/sarracenia/blob/4a9f6424eaee6ae50f6343bf6a7643a615c3740a/sarracenia/flow/__init__.py#L1638-L1650

and interestingly it looks like it used to do that even for directories, but that got commented out... hmm...

petersilva commented 3 weeks ago

going back to the use cases... how the heck is the ln supposed to know that an mv is coming? I guess could adopt semantics in the messages:

petersilva commented 3 weeks ago

things that come to mind:

option 1: fuse directory rename with link creation into single event.

option 2: use pubtimes to refuse to apply a "remove" too late.

petersilva commented 2 weeks ago

Another perceived ordering constraint:

Create Directory, then Write Files in it: Not a Problem.

One would expect that if the operations are received in the wrong order, the write would fail because the directory x does not exist. But that's not what happens. The directory does gets created with the file within needs to be written. Otoh, the automated directory creation does not know about special directory permissions (say 711) and so would be created according to some default rules, rather than reflecting the source.

If the mkdir event arrives first, then the permissions are set correctly immediately. If the mkdir event arrives after the file, then when that event is processed, the permissions will be corrected. This involves some extra time where the permissions may deviate from expected.

The impact of this sort of race condition looks minimal.

petersilva commented 2 weeks ago

Transitive Copies: Not a Problem.

interpretations:

These sorts of of questions are resolved by waiting a certain amount of time (in HPC mirror, 30 seconds) and publishing the net result once things have quieted down. So 30 seconds later publish the content of a, b, and c (which will all be hello) and the result on the mirror should be identical.

After publication, there is a winnowing layer which collapses multiple i/os into a single net result...

petersilva commented 2 weeks ago

One way of getting back ordering:

This approach:

petersilva commented 2 weeks ago

if someone wants to have job-id sent in messages, they should be able to add:

header JOBID=${PBS_JOBID}

but looking at the code... I don't think variable substitutions are done on the header values. hmm...

petersilva commented 2 weeks ago

tested it on the C side, and it works there... header home=${HOME} was evaluated properly.

petersilva commented 2 weeks ago

Method for operations from a single job sequenced to a single subscriber.

Each subscriber binds to one of the 40 exchanges.

the needed plugin:


import logging

from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)

class Exchange_selected_by_jobid(FlowCB):
     """"
       pick output exchange based on hash of job-id, if available.
    """" 
    def after_accept( self, worklist ):
          for m in worklist.incoming:
                if 'JOBID' in m and self.o.post_exchangeSplit:
                       m['exchangeSplitOverriede'] = sum(bytearray(m['JOBID']))%self.o.post_exchangeSplit
petersilva commented 1 week ago

race_conditions.py.txt another approach would be to have a separate callback filter that would look at the entire flow to identify hard ordering dependencies... this would mean no change to ops configs, just an auditing subscription. Is this possible?

subscribe to output of current mirror, with a shovel/subscriber like so:

broker amqps://user@broker
exchange xs_user_public

download no
batch 100

logEvents on_housekeeping

callback tally_volume

It will write out a report every 5 minutes of files that had multiple operations done on them in < 5 minutes... Will investigate with client to see if this helps with auditing.

petersilva commented 1 week ago

for the job oriented sequencing method (not auditing), it would be a heck of a lot simpler to specify if https://github.com/MetPX/sarracenia/issues/624 exchangeSplit were implemented. Then the subscriber would have lines like:

instance 40
exchangeSplit 40

and it would declare 40 queues to each subscribe to one exchange. (matching the post_exchangeSplit from the winnow. Without that, we need to define 40 subscribers, each with different queue and exchange settings.

racetted commented 1 week ago

Wouldn't we still get a race condition in a set-up with statehost option + N nodes, since we would still end up with N subscribers instances bound to one queue per exchange?

petersilva commented 1 week ago

yes... the nodes need to be taken care of also. I thought all the subscribers are on one node. Is that not the case?

racetted commented 1 week ago

There are a few nodes in the current set-up, yes.