Fireworks getting locked on large runs

tahorst commented 6 years ago

I've been trying to do some larger runs with fireworks and noticed that a lot of the time a firework will get locked and prevent further execution. I think this is a recent issue since the switch to Sherlock 2.0 or possibly with some updated packages. Trying to run 64 seeds, I was unable to complete the first generation for any simulation using qlaunch and got the following message at the end of most:

2018-07-11 00:55:21,494 INFO Task completed: {{wholecell.fireworks.firetasks.simulation.SimulationTask}} 
2018-07-11 01:00:29,188 INFO fw_id 1249 locked. Can't refresh!
2018-07-11 01:00:29,192 INFO Rocket finished

I think this happens when multiple fireworks try to access the database at the same time and then time out. The jobs never get marked as COMPLETED so the next sim can't start. When running 8 seeds, I noticed the same behavior but it was slightly more sporadic with some jobs able to complete before the database being locked. We might need to increase the delay time between launching runs but that will significantly increase the amount of time it takes to run larger sets.

jmason42 commented 6 years ago

I've typically had no issue running hundreds of (admittedly short) jobs directly via Slurm commands. Do you think the issue is the remote database? In the immediate we could just pay for a fast/stable database and use that to get our paper simulations executed.

tahorst commented 6 years ago

The issue is in this file /share/PI/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/launchpad.py:

class WFLock(object):
    """
    Lock a Workflow, i.e. for performing update operations
    Raises a LockedWorkflowError if the lock couldn't be acquired withing expire_secs and kill==False.
    Calling functions are responsible for handling the error in order to avoid database inconsistencies.
    """

    def __init__(self, lp, fw_id, expire_secs=WFLOCK_EXPIRATION_SECS, kill=WFLOCK_EXPIRATION_KILL):
        """
        Args:
            lp (LaunchPad)
            fw_id (int): Firework id
            expire_secs (int): max waiting time in seconds.
            kill (bool): force lock acquisition or not
        """
        self.lp = lp
        self.fw_id = fw_id
        self.expire_secs = expire_secs
        self.kill = kill

    def __enter__(self):
        ctr = 0
        waiting_time = 0
        # acquire lock
        links_dict = self.lp.workflows.find_one_and_update({'nodes': self.fw_id,
                                                            'locked': {"$exists": False}},
                                                           {'$set': {'locked': True}})
        # could not acquire lock b/c WF is already locked for writing
        while not links_dict:
            ctr += 1
            time_incr = ctr/10.0+random.random()/100.0
            time.sleep(time_incr)  # wait a bit for lock to free up
            waiting_time += time_incr
            if waiting_time > self.expire_secs:  # too much time waiting, expire lock
                wf = self.lp.workflows.find_one({'nodes': self.fw_id})
                if not wf:
                    raise ValueError("Could not find workflow in database: {}".format(self.fw_id))
                if self.kill:  # force lock acquisition
                    self.lp.m_logger.warn('FORCIBLY ACQUIRING LOCK, WF: {}'.format(self.fw_id))
                    links_dict = self.lp.workflows.find_one_and_update({'nodes': self.fw_id},
                                                                       {'$set': {'locked': True}})
                else:  # throw error if we don't want to force lock acquisition
                    raise LockedWorkflowError("Could not get workflow - LOCKED: {}".format(self.fw_id))
            else:
                # retry lock
                links_dict = self.lp.workflows.find_one_and_update(
                    {'nodes': self.fw_id, 'locked': {"$exists": False}}, {'$set': {'locked': True}})

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.lp.workflows.find_one_and_update({"nodes": self.fw_id}, {"$unset": {"locked": True}})

It seems the find_one_and_update which is coming from a MongoClient object is failing to get a response from the database that isn't locked. I think the DB takes some time to update things since it seemed to work when I inserted a break point before the exception is raised.

1fish2 commented 6 years ago

This Fireworks intro says there's only one LaunchPad server that manages workflows and the workers communicate with it rather than the DB. If LaunchPad is the only process that accesses the DB (other than manual queries and maybe the web interface), it really shouldn't have DB contention. More likely, 'locked' is ordinary data on the firework node in the DB. This pymongo API seems to corroborate that.

Q. Does it help if you run lpad reset then start over?

(I've had to reset on a few occasions lately when running the workflow started by running analysis tasks. Maybe I used Ctrl-C to exit on the previous run.)

Q. Could it be deadlocking at a higher level by, say, trying to allocate tasks to more workers than available?

Q. Could the fw_id values be colliding?

tahorst commented 6 years ago

I tracked the issue to completion of the fireworks. When one completes, it will lock access to the database as it tries to update the status of all dependent fireworks. For us, this hangs (~40 seconds for 8 gens/64 seeds) on the cohort analysis task because it has so many dependencies that it checks through. See this line for the main time drain.

Adding a fail fast approach drops this down to < 1 second and should scale to any point in the sim because based on our firetask order, it will check the later sims first.

        # what are the parent states?
        for p in self.links.parent_links.get(fw_id, []):
            if self.id_fw[p].state not in completed_parent_states:
                m_state = 'WAITING'
                break
        else:  # not DEFUSED/ARCHIVED, and all parents are done running. Now the state depends on the launch status
            # my state depends on launch whose state has the highest 'score' in STATE_RANKS
            m_launch = self._get_representative_launch(fw)
            m_state = m_launch.state if m_launch else 'READY'
            m_action = m_launch.action if (m_launch and m_launch.state == "COMPLETED") else None

            # report any FIZZLED parents if allow_fizzed allows us to handle FIZZLED jobs
            if fw.spec.get('_allow_fizzled_parents') and "_fizzled_parents" not in fw.spec:
                parent_fws = [self.id_fw[p].to_dict() for p in self.links.parent_links.get(fw_id, [])
                              if self.id_fw[p].state == 'FIZZLED']
                if len(parent_fws) > 0:
                    fw.spec['_fizzled_parents'] = parent_fws
                    updated_ids.add(fw_id)

I can probably submit a PR for fireworks with this fix but we might also want to manually change it in our environment in the mean time.

tahorst commented 6 years ago

Alternatively we could trim down dependencies that we create (eg cohort only needs to dependent on the last gen of each of the sims) or run without the analysis tasks.

jmason42 commented 6 years ago

So are cohort analysis tasks dependent on the completion of every simulation? The way fw_queue.py is written makes it hard to tell what the dependencies are. Reducing the dependency tree to the simplest equivalent representation seems like something Fireworks should be doing under the hood... Something to consider if we write our own workflow manager.

tahorst commented 6 years ago

These lines add the dependencies for each sim. We should be able to check if k == N_GENS-1 before adding the cohort and variant analysis depedencies.

1fish2 commented 6 years ago

I need to read up on Fireworks but we seem to be overloading its dependency manager.

This Q&A on the fireworkflows Google Group (you have to join the group to read it) asks about an extreme case, ~50,000 fireworks with no interdependencies, leading to lock expirations and lots of these messages:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can't refresh!

[Alas, web searches don't find error messages in closed Google Groups.]

The solution:

if there are no interdependencies between the Fireworks, then why use only a single workflow (instead of putting each FW in its own workflow)? The main reason to put multiple FWs into one workflow is to ensure that dependencies are executed correctly.

the lock is specific to a workflow. If you instead used 50,000 workflows, each with a single FW (instead of 1 workflow with 50,000 Fireworks) then you probably wouldn't run into the locking issue.

An aggregate analysis thingie [whatever the right term] could just depend on an aggregate simulation thingie. In theory that means less parallelization but in practice that might not matter even if the dependency manager didn't trip over itself.

If we want to change the library in the short term, can we do it by monkey-patching its methods rather than forking the code?

I want to make fw_queue.py more modular. Factors like the dependency network would get easier to understand and change.

tahorst commented 6 years ago

Great to know about the google group! But yeah at a certain point we're just going to have too many fireworks for it to update quickly enough. We could think about having a simulation workflow and then an analysis workflow and then a compression workflow. Might lose out on some parallel computation time but should reduce some of the dependencies especially if one workflow could depend on another.

I can monkey-patch it. I'll test it a little more rigorously but I think the code above should work. Should we document the change somewhere in case we update fireworks sometime?

1fish2 commented 6 years ago

Yes, please do document the change. This can go on the wiki page about the runtime environment.

[When publishing the repo, we'll need to put that wiki page into a doc in the repo.]

I was thinking about a utility that monkey-patches Fireworks at runtime, e.g. the WFLock __enter__() method.

It turns out that we no longer have any dependencies on cobra (a library for genome-scale modeling of metabolic networks). That lets us update to the latest Fireworks 1.7.5 and pip 10.0.1. I'll put that in the same PR as updating to matplotlib 2.2.2 and maybe some other updates as well. Anyway, updating Fireworks makes monkey-patching fragile.

tahorst commented 6 years ago

Ok I applied the fix to the wcEcoli2 pyenv fireworks file and documented it in the wiki. I think this should address the majority of the slow updates with fireworks. I'll see if I can submit a PR to them so hopefully it will be fixed in future versions if we stick with fireworks.

Good to know we can remove cobra. Sounds like a good plan for updating.

CovertLab / wcEcoli

Fireworks getting locked on large runs #202