hard ordering constraints in scripts in mirrors.

petersilva commented 3 weeks ago

It is a clear limitation of the entire method of mirroring that order of operations are not strictly maintained. Files are posted, mostly in the order the changes are created, then the events are queued. There are subscribers that pick from the queue and in the case of HPC mirroring, the queue is shared among 40 subscribers... so any event can be in the queue for any of the subscribers.

Since different subscribers can pick up different events, and the subscribers are asynchronous to each other, the events can be executed on the subscriber side out of order with respect to the source. This "out of order" ness is intrinsic to the copying algorithm. Getting rid of out of order will force sequential operations which is expected to have a large performance impact... both in sychnronization primitives among the subscribers, and in reduction of operations that can proceed in parallel.

Note that, between the posting layer and the subscribing layer, there is the winnow layer. The winnow layer delays copies for 30 seconds, and squashes multiple changes to the same file so that only a single update is produced (if the file is re-written 10 times in 30 seconds, a post will be published for that file 30 seconds after the last write.) This is to de-noise the copies... so extremely transient files are not copied.

So we have a script. it does:


rm X
ln -sf y X

but the mirroring logic can re-arrange things so that the ln arrives and is executed before the rm. so instead of getting a new contents of X linked to y, X is just gone.

petersilva commented 3 weeks ago

one approach:

when processing a file, note the publication time of the remove event, note that it occurs before the ln... and choose to ignore it.
- if the rm does not happen first, the does the link not stand a good chance of failing? so isn't this fragile anyways?
- if you have 'f' isn't the rm useless... the f means it will overwrite what is there... so there is no need for the rm in this case.

OK.. so in the first example, the rm is useless... make a new example (assum X starts out as a directory):


mv X Z
ln -sf y X

Assuming normal Linux rules:

if it happens in the wrong order, you get a link Z/y pointing to y ...
if applied in the correct order you get a link X pointing to y and Z as a separate directory.

petersilva commented 3 weeks ago

looking at the logic in the subscriber... it seems like it is always -f... it removes the existing file before creating the link.

https://github.com/MetPX/sarracenia/blob/4a9f6424eaee6ae50f6343bf6a7643a615c3740a/sarracenia/flow/__init__.py#L1638-L1650

and interestingly it looks like it used to do that even for directories, but that got commented out... hmm...

petersilva commented 3 weeks ago

going back to the use cases... how the heck is the ln supposed to know that an mv is coming? I guess could adopt semantics in the messages:

when you get a symlink message the "new" path is always the actual link, and not a directory containing the link.
then if there is a conflict (trying to create a link where there is a directory.) you can put out a smart error message, rather tha misinterpreting.
Currently I think the code just does the "wrong" thing, interpreting like linux making a link with the name of the source in the destination directory... instead of replacing the dest. directory with a link of the same name.

petersilva commented 3 weeks ago

things that come to mind:

option 1: fuse directory rename with link creation into single event.

there is a winnow between poster and subscriber.
- links and mkdirs will have similar checksum algos, they should end up in the same instance in a split.
- if we are delaying for 30 seconds... is it possible to build a special message that merges directory rename and link creation into one thing, so timing ceases to be a problem?

option 2: use pubtimes to refuse to apply a "remove" too late.

when doing an operation, persist the pubtime to the file (with extended attributes.)
if the pubtime of the remove received is earlier than what is in the file, then refuse to apply it. (error message: this remove event applies to an earlier version of the file.)
wrinkle: ex attrs on symlinks ... I think they end up on the target instead... hmm...

petersilva commented 2 weeks ago

Another perceived ordering constraint:

Create Directory, then Write Files in it: Not a Problem.

mkdir x
write the file x/y

One would expect that if the operations are received in the wrong order, the write would fail because the directory x does not exist. But that's not what happens. The directory does gets created with the file within needs to be written. Otoh, the automated directory creation does not know about special directory permissions (say 711) and so would be created according to some default rules, rather than reflecting the source.

If the mkdir event arrives first, then the permissions are set correctly immediately. If the mkdir event arrives after the file, then when that event is processed, the permissions will be corrected. This involves some extra time where the permissions may deviate from expected.

The impact of this sort of race condition looks minimal.

petersilva commented 2 weeks ago

Transitive Copies: Not a Problem.

echo hello >a
echo bonjour >b
cp a b
cp b c

interpretations:

if the instructions are run in order, all three files contain hello.
If the last two operations are inverted, files a and b contain hello while c contains bonjour.
if b isn't created before cp b c ... an error occurs.
if a isn't created before cp a b ... an error occurs.

These sorts of of questions are resolved by waiting a certain amount of time (in HPC mirror, 30 seconds) and publishing the net result once things have quieted down. So 30 seconds later publish the content of a, b, and c (which will all be hello) and the result on the mirror should be identical.

After publication, there is a winnowing layer which collapses multiple i/os into a single net result...

petersilva commented 2 weeks ago

One way of getting back ordering:

have the "post" include the job-id in each file posting.
have the winnow layer use post_exchangeSplit.
have the algorithm for doing post_exchangeSplit be based on a hash of the job-id.
events for the same job-id will end up on the same subscriber... encouraging natural serialization.

This approach:

doesn't require any co-ordination among clients...
will make load "lumpy" in that, say all copies for a given model job will end up in one copy process. (overall performance will be worse.)
Not sure if order is entirely preserved due to the winnow layer delays... but might.

petersilva commented 2 weeks ago

if someone wants to have job-id sent in messages, they should be able to add:

header JOBID=${PBS_JOBID}

but looking at the code... I don't think variable substitutions are done on the header values. hmm...

petersilva commented 2 weeks ago

tested it on the C side, and it works there... header home=${HOME} was evaluated properly.

petersilva commented 2 weeks ago

Method for operations from a single job sequenced to a single subscriber.

ok, so there is a field exchangeSplitOverride that can be set using a plugin.
the C sets the jobID as described above.
the winnow flows would need post_exchangeSplit 40
A (to be written) plugin sets exchangeSplitOverride based on a hash of the JOBID
the plugin is configured into the winnow to alter the post_exchange.

Each subscriber binds to one of the 40 exchanges.

file operations should be fairly close to sequential within a single job.
If you don't set jobID, then exchangeSplitOverride will not be set, it will use a default load balancing method to pick the output exchange.

the needed plugin:


import logging

from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)

class Exchange_selected_by_jobid(FlowCB):
     """"
       pick output exchange based on hash of job-id, if available.
    """" 
    def after_accept( self, worklist ):
          for m in worklist.incoming:
                if 'JOBID' in m and self.o.post_exchangeSplit:
                       m['exchangeSplitOverriede'] = sum(bytearray(m['JOBID']))%self.o.post_exchangeSplit

petersilva commented 1 week ago

race_conditions.py.txt another approach would be to have a separate callback filter that would look at the entire flow to identify hard ordering dependencies... this would mean no change to ops configs, just an auditing subscription. Is this possible?

subscribe to output of current mirror, with a shovel/subscriber like so:

broker amqps://user@broker
exchange xs_user_public

download no
batch 100

logEvents on_housekeeping

callback tally_volume

It will write out a report every 5 minutes of files that had multiple operations done on them in < 5 minutes... Will investigate with client to see if this helps with auditing.

petersilva commented 1 week ago

for the job oriented sequencing method (not auditing), it would be a heck of a lot simpler to specify if https://github.com/MetPX/sarracenia/issues/624 exchangeSplit were implemented. Then the subscriber would have lines like:

instance 40
exchangeSplit 40

and it would declare 40 queues to each subscribe to one exchange. (matching the post_exchangeSplit from the winnow. Without that, we need to define 40 subscribers, each with different queue and exchange settings.

racetted commented 1 week ago

Wouldn't we still get a race condition in a set-up with statehost option + N nodes, since we would still end up with N subscribers instances bound to one queue per exchange?

petersilva commented 1 week ago

yes... the nodes need to be taken care of also. I thought all the subscribers are on one node. Is that not the case?

racetted commented 1 week ago

There are a few nodes in the current set-up, yes.

MetPX / sarrac