How does the example cron based backup job work?

asdf01 commented 8 years ago

Hey Guys

Neat project. So many awe inspiring open source projects on the internet.

I got one question on the example cron based backup job detailed here: https://github.com/facebookgo/rocks-strata/blob/master/examples/backup/run.sh

How would this backup job work? I'm not an unix expert, but according to the start-stop-daemon documentation I found, it will prevent a call from starting a new process if a process with the same name is already running is that right?

The way I understand it, because it's running the process as a daemon, wouldn't the call to "backup" return immediately and the run.sh script would attempt to run the 2nd line "delete" straight afterwards while "backup" is still running?

So does that mean if "backup" takes more than a few milliseconds to run, "delete" would never actually get run?

I am only questioning the use of the start-stop-daemon because I need to find an alternative because the process isn't available in amazon linux. To achieve the purpose of avoiding multiple concurrent writes to the same file in S3, wouldn't it be fine to do the 3 strata calls sequentially one after the other in run.sh? When I tested the the strata calls manually, it looked like it was running in the foreground and doesn't return control to the bash console until it is finished. Would this work?

Thanks

AGFeldman commented 8 years ago

You're right that manually calling strata will run in the foreground and not return control to the bash console until it is finished.

The concern is that for a given replica ID, only one write-capable strata operation should run at once. Suppose you put 3 sequential strata calls in a bash script, and then execute that script once every two hours. What if the script takes longer than 2 hours to finish? Then your first invocation would still be running when your second invocation starts, and you would risk running two write-capable strata operations at once.

This is the problem that I tried to address in the example with start-stop-daemon. But you might be right about start-stop-daemon's behavior; I'm not very familiar with it, and we do not actually use it to run strata.

tredman commented 8 years ago

At Parse we actually ended up using a python wrapper to chain the commands together, with a file lock to protect against multiple backups running at once. Here is the code for reference, maybe it can help you.

def _do_strata_backup(replicaset, hostname, bucket_name="mybucket",
            region="us-east-1", bucket_prefix="mongo-rocks"):
    # get flock
    if not helper.flock(BACKUP_FLOCK_PATH):
        error("Failed to obtain lock on %s. Is there another backup running?"
                % BACKUP_FLOCK_PATH)

        return 1

    # kick off strata
    # backup
    strata_cmd = "/usr/bin/strata " \
            "--bucket=" + bucket_name + " " \
            "--region=" + region + " " \
            "--bucket-prefix=" + bucket_prefix + " " \
            "backup " \
            "--replica-id=" + replicaset + "_" + hostname
    return_code = helper.run_shell_cmd(strata_cmd)

    # only do metadata cleanup if the backup succeeded
    if return_code == 0:
        delete_cmd = "/usr/bin/strata " \
            "--bucket=" + bucket_name + " " \
            "--region=" + region + " " \
            "--bucket-prefix=" + bucket_prefix + " " \
            "delete " \
            "--replica-id=" + replicaset + "_" + hostname + " " \
            "--age=" + ROCKS_BACKUP_RETENTION
        delete_code = helper.run_shell_cmd(delete_cmd)

        # only run gc if metadata delete succeeded
        if delete_code == 0:
            gc_cmd = "/usr/bin/strata " \
                "--bucket=" + bucket_name + " " \
                "--region=" + region + " " \
                "--bucket-prefix=" + bucket_prefix + " " \
                "gc " \
                "--replica-id=" + replicaset + "_" + hostname + " "
            gc_code = helper.run_shell_cmd(gc_cmd)
            if gc_code != 0:
                warn("Got error code %d when runnign gc" % gc_code)
        else:
            warn("Got error code %d when running metadata cleanup" %
                    delete_code)

    # clear flock
    if not helper.unflock(BACKUP_FLOCK_PATH):
        warn("Failed to cleanly release lock on %s" % BACKUP_FLOCK_PATH)

    return return_code

run_shell_command is just a wrapper around subprocess.Popen

def run_shell_cmd(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT):
    args = shlex.split(command)
    proc = subprocess.Popen(args, stdout=stdout, stderr=stderr)

    while proc.poll() is None:
        cmd_out = proc.stdout.readline()
        info(cmd_out.rstrip())

    return proc.returncode

and flock:

def flock(path):
    try:
        f = os.open(path, os.O_CREAT)
        fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
    except IOError:
        return False

    return True

igorcanadi commented 8 years ago

@AGFeldman @tredman glad to have both of you back! :)

asdf01 commented 8 years ago

Hey @AGFeldman @tredman

Thanks for the quick response.

What if the script takes longer than 2 hours to finish

Yes I see why start-stop-daemon was used now. Unfortunately it is not available for redhat6 / amazon linux. I did find some start-stop-daemon rpms people have built for redhat. But it's black box and I can't be completely sure what they do. I have a picture in my head of me living under a bridge and having to explain to my children how I lost my job because I downloaded something off the internet and put it on a production database server.

The at command seems to have a queue feature. But when I tested it out, it didn't behave anything like a queue. False advertising, hours wasted.

Here is the code for reference

Thanks for supplying me with your python code. I don't want to sound ungrateful but I probably won't use it because it is yet another language for me to learn.

At the moment I think the best way might be to simply write a bash script which creates a PID file and simply cron the bash script at regular intervals. If the process is still running from the last scheduled run, the current one will simply exit similar to how the start-stop-daemon based script was intended to work.

Thanks again for openly sharing all your wisdom. Without people like you guys, there probably will be no internet and people like me probably wouldn't have jobs. Thanks.

thapakazi commented 6 years ago

there there... @asdf01 thanks to you too, you spoke my head clear :neckbeard:

facebookarchive / rocks-strata

How does the example cron based backup job work? #20