Open petersilva opened 3 weeks ago
one approach:
OK.. so in the first example, the rm is useless... make a new example (assum X starts out as a directory):
mv X Z
ln -sf y X
Assuming normal Linux rules:
looking at the logic in the subscriber... it seems like it is always -f... it removes the existing file before creating the link.
and interestingly it looks like it used to do that even for directories, but that got commented out... hmm...
going back to the use cases... how the heck is the ln supposed to know that an mv is coming? I guess could adopt semantics in the messages:
things that come to mind:
Another perceived ordering constraint:
One would expect that if the operations are received in the wrong order, the write would fail because the directory x does not exist. But that's not what happens. The directory does gets created with the file within needs to be written. Otoh, the automated directory creation does not know about special directory permissions (say 711) and so would be created according to some default rules, rather than reflecting the source.
If the mkdir event arrives first, then the permissions are set correctly immediately. If the mkdir event arrives after the file, then when that event is processed, the permissions will be corrected. This involves some extra time where the permissions may deviate from expected.
The impact of this sort of race condition looks minimal.
interpretations:
These sorts of of questions are resolved by waiting a certain amount of time (in HPC mirror, 30 seconds) and publishing the net result once things have quieted down. So 30 seconds later publish the content of a, b, and c (which will all be hello) and the result on the mirror should be identical.
After publication, there is a winnowing layer which collapses multiple i/os into a single net result...
One way of getting back ordering:
This approach:
if someone wants to have job-id sent in messages, they should be able to add:
header JOBID=${PBS_JOBID}
but looking at the code... I don't think variable substitutions are done on the header values. hmm...
tested it on the C side, and it works there... header home=${HOME} was evaluated properly.
Each subscriber binds to one of the 40 exchanges.
the needed plugin:
import logging
from sarracenia.flowcb import FlowCB
logger = logging.getLogger(__name__)
class Exchange_selected_by_jobid(FlowCB):
""""
pick output exchange based on hash of job-id, if available.
""""
def after_accept( self, worklist ):
for m in worklist.incoming:
if 'JOBID' in m and self.o.post_exchangeSplit:
m['exchangeSplitOverriede'] = sum(bytearray(m['JOBID']))%self.o.post_exchangeSplit
race_conditions.py.txt another approach would be to have a separate callback filter that would look at the entire flow to identify hard ordering dependencies... this would mean no change to ops configs, just an auditing subscription. Is this possible?
subscribe to output of current mirror, with a shovel/subscriber like so:
broker amqps://user@broker
exchange xs_user_public
download no
batch 100
logEvents on_housekeeping
callback tally_volume
It will write out a report every 5 minutes of files that had multiple operations done on them in < 5 minutes... Will investigate with client to see if this helps with auditing.
for the job oriented sequencing method (not auditing), it would be a heck of a lot simpler to specify if https://github.com/MetPX/sarracenia/issues/624 exchangeSplit were implemented. Then the subscriber would have lines like:
instance 40
exchangeSplit 40
and it would declare 40 queues to each subscribe to one exchange. (matching the post_exchangeSplit from the winnow. Without that, we need to define 40 subscribers, each with different queue and exchange settings.
Wouldn't we still get a race condition in a set-up with statehost option + N nodes, since we would still end up with N subscribers instances bound to one queue per exchange?
yes... the nodes need to be taken care of also. I thought all the subscribers are on one node. Is that not the case?
There are a few nodes in the current set-up, yes.
It is a clear limitation of the entire method of mirroring that order of operations are not strictly maintained. Files are posted, mostly in the order the changes are created, then the events are queued. There are subscribers that pick from the queue and in the case of HPC mirroring, the queue is shared among 40 subscribers... so any event can be in the queue for any of the subscribers.
Since different subscribers can pick up different events, and the subscribers are asynchronous to each other, the events can be executed on the subscriber side out of order with respect to the source. This "out of order" ness is intrinsic to the copying algorithm. Getting rid of out of order will force sequential operations which is expected to have a large performance impact... both in sychnronization primitives among the subscribers, and in reduction of operations that can proceed in parallel.
Note that, between the posting layer and the subscribing layer, there is the winnow layer. The winnow layer delays copies for 30 seconds, and squashes multiple changes to the same file so that only a single update is produced (if the file is re-written 10 times in 30 seconds, a post will be published for that file 30 seconds after the last write.) This is to de-noise the copies... so extremely transient files are not copied.
So we have a script. it does:
but the mirroring logic can re-arrange things so that the ln arrives and is executed before the rm. so instead of getting a new contents of X linked to y, X is just gone.