Opentrons / opentrons

Software for writing protocols and running them on the Opentrons Flex and Opentrons OT-2
https://opentrons.com
Apache License 2.0
415 stars 177 forks source link

question: how to safely handle a keyboard interrupt in a protocol #11360

Open laura-wyzer opened 2 years ago

laura-wyzer commented 2 years ago

Overview

A program we wrote with the Opentrons API uses a try/except block so that the program can stop, home the robot, and drop tips when the user hits Ctrl+C. This has been working for weeks, then all of the sudden had unexpected behavior when we interrupted the program as it was picking up a tip. After homing the robot, the pipette rapidly traversed the deck and jammed into the loaded labware, cause the tip to bend. The error below was produced:

SENDTOOPENTRONS

We understand that interrupting the program as it was picking up a tip could cause an error (even though we don't know how that would happen), but are unsure why the pipette flew across the deck?

Steps to reproduce

  1. Create a program that uses the API to run a protocol and catches keyboard interruptions that occur while running the protocol.
  2. Cause a keyboard interruption right when the robot is picking up a tip

note: do not do this with the labware on the deck since this broke our pipette tip

Current behavior

  1. Keyboard interruptions that occur while a tip is being picked up cannot be handled
  2. In response to not handling this issue, the robot traverses the deck and ignores the loaded labware

Expected behavior

Upon any keyboard interruption at any point in the protocol, the robot should follow the defined steps in our try/except block (homing the robot, dropping the tips, ending the program)

Operating system

Windows

System and robot setup or anything else?

mcous commented 2 years ago

Are you able to post your protocol, or some minimal reproduction that shows how and where exactly you are using a try/except, and what's in your recovery block? If your try/except is overly broad, it can prevent proper operation of the protocol execution system. There's also a high chance it interferes with the hardware control layer's understanding of robot state (e.g. is there a tip on the pipette?)

laura-wyzer commented 2 years ago

Hi Mike, Thank you for getting back to me! We have a nondisclosure policy that prevents us from posting our code. However, if there is a confidential way to send it to a member of Opentrons I can certainly do that as long as it is not shared with a third party. It usually seems like the robot maintains an understanding of whether or not there is a pipette on the tip, since the recovery block accurately drops the tip if there is a tip on it (following code that we wrote) whereas it just homes the robot and ends as instructed if there are no tips on the pipette. There is nothing in our code that tells the robot to move back across the deck after homing (which is what happened when the tip broke), so I think it would be more likely that our program may be preventing proper operation of the protocol execution system. Please let me know how to best share our code and thank you for the help! Best, Laura Drepanos

mcous commented 2 years ago

You may email me directly at mike at opentrons dot com. However, a full protocol may be difficult for me to read or reason with. A small reproduction protocol would be much more helpful, if possible.

It would also be helpful simply to see how you are constructing your try/except block, without any of the contents of the try, e.g.

laura-wyzer commented 2 years ago

Hi Mike, I see what you are saying now! Here is the general structure of the program/ lines that may be important:

import opentrons.execute, from opentrons import protocol_api
protocol = opentrons.execute.get_protocol_api(opentrons.protocol_api.MAX_SUPPORTED_VERSION)

def run(protocol: protocol_api.ProtocolContext):
    ####_(protocol defined within the run function)_

try:
    run(protocol)
    protocol.home()
except KeyboardInterrupt:
    print(" user stopped program!")
    protocol.home()
    for i in protocol.loaded_instruments.values():
        if i.has_tip:
            i.drop_tip()
except Exception as e:
    traceback.print_exc()
    protocol.home()
    for i in protocol.loaded_instruments.values():
        if i.has_tip:
            i.drop_tip()

So to answer your questions:

Best, Laura

mcous commented 2 years ago

Thanks for the snippet, that is helpful! I definitely think this is an unsafe construct. The fact that it worked for a while just means you got luck with the timings of your ctrl-c presses.

It's very important that the hardware is told to halt before continuing with any cleanup activities, like drop tip. In fact, this is exactly how protocols run via the app/HTTP API behave when you issue a cancel:

  1. Halt the hardware to stop all movement / prevent future movement
  2. Reset the hardware
  3. Proceed with any homing and drop tips

Without the halt, this exception handler as written may start interleaving requests to the hardware layer (which is running asynchronously in another thread), causing unexpected movements like the ones you observed.

I'm going to need to look into how to best do this in Jupyter / command line, but it will likely be something along the lines of "move recovery to its own script / process so that everything can settle after a KeyboardInterupt happens"

laura-wyzer commented 2 years ago

Hi Mike, I see what you are saying, thanks for the help. I'm not sure how I would halt the hardware using python code since there isn't anything about this on the website, so that would be very helpful if you are able to find any options! Or if there are any alternative implementations for replicating the app's feature of homing the robot and dropping tips when the program is cancelled, I can look into those. Thank you, Laura

mcous commented 2 years ago

@laura-wyzer it's going to take me until Monday to really start testing this out, but in the mean time, do you know what happens to your protocol if you remove the try/except block and simply press ctrl-c? Does the protocol halt, or does it continue executing?

laura-wyzer commented 2 years ago

Hi Mike, the protocol does halt if you remove the try/except block and press ctrl+c.

caroline-wyzer commented 2 years ago

Hi Mike, I'm a colleague of Laura's. Any updates on how this might work? We would definitely like to implement something other than just the ctrl+c, since we'll be running most of our programs through ssh anyways. I'm looking through the code on here now and trying to figure out what to replicate. Thanks!

croots commented 1 year ago

Hi,

I don't know if its useful, but I wanted to pitch in and say that we are seeing a similar issue on our end that matches the behavior @mcous is mentioning. It seems like whenever a set of instructions is halted mid action (pick up tip, drop tip) and then an action is immediately queued, the robot will finish whatever it was doing after the next queued action.

This behavior persists across separate jupyter notebook blocks. If I interrupt the interpreter and immediately queue the 'recover and reset' block, the 'recover and reset' block will run and then the last behavior (ex 'pick up tip') that was interrupted will complete (ex 'home the pipette above where the tips were picked up').

All this pointing to the issue that whatever is handling robot actions on the back end is vulnerable to race conditions.

mcous commented 1 year ago

The existing protocol execution system for Python protocols is pretty fundamentally susceptible to race conditions if you're trying to reach in and cancel a protocol run from the same place you are triggering protocol commands. In other words, SSH and Jupyter.

We've been on a multi-year journey of rearchitecting this system, and we're making progress! JSON protocols have been moved to the new system, but Python protocols are still a ways out.

The Opentrons App, however, does not suffer these problems, because it communicates with the robot over an HTTP API. It uses this API to upload a protocol file and kick off a run. Starting, pausing, and stopping the run can all be accomplished with subsequent HTTP requests, and since they come in externally, they can safely and gracefully shut down the run.

If you are able to use HTTP instead of Jupyter or SSH, I think you might have a better experience. Is this something that your workflow would be amenable to? For example, you could write a Python script that runs from your own computer to upload the protocol, run it, and wait for it to complete. In that script, you could wire a KeyboardInterrupt to send an HTTP stop request.

I can add more details to this thread if you're interested!

hibazou commented 4 months ago

Hi ! Please can you add more details to this thread ? I Have already a Python Script to upload my protocol and run it. Can you give me guidelines to wire a KeyboardInterrupt to send an HTTP stop/pause/continue request ?

caroline-wyzer commented 4 months ago

Hi! The way I ended up solving this for my purposes was not using keyboard interrupt; it was simpler for us to simply kill the process on the robot and then home the robot. Here's the code we run:

    try:

        # change working directory to jupyter notebooks
        os.chdir("/var/lib/jupyter/notebooks")

        # find all currently running processes and make them into a nice list
        current_processes = subprocess.check_output(["ps", "aux"])
        current_processes = current_processes.decode("UTF-8")
        current_processes = current_processes.split()

        # pull a list of all files on jupyter notebook
        possible_files = []
        for root, dirs, files in os.walk("."):
            for filename in files:
                possible_files.append(filename)
        python_files = []

        # find all files with a python extension. Want to make sure we only kill python files.
        for file in possible_files:
            extension = file[len(file) - 2:]
            if (extension == "py") and (file != "cancel.py"):
                python_files.append(file)

        # match currently running python files to pids
        for i, item in enumerate(current_processes):
            if item in python_files:
                PID_to_kill = current_processes[i - 3]

        # actually kills
        try:
            os.system(f"kill -9 {PID_to_kill}")
        except NameError:
            pass

        # connects to robot for homing
        protocol = opentrons.execute.get_protocol_api(opentrons.protocol_api.MAX_SUPPORTED_VERSION)

        protocol.set_rail_lights(False)

        try:
            for pip in protocol.loaded_instruments.values():
                pip.drop_tip()
        except Exception as e:
            log_output(message=traceback.format_exc())

        protocol.home()
        os.system("~.")

    except Exception as e:
        log_output(message=traceback.format_exc())

        # connects to robot
        protocol = opentrons.execute.get_protocol_api(opentrons.protocol_api.MAX_SUPPORTED_VERSION)

        protocol.home()

        protocol.set_rail_lights(False)

        os.system("~.")

When we want to cancel something, we send the above code to the robot and execute it by running this code on the computer we have connected to the OT2:

    def cancel(self) -> None:
        """
        Cancel the ongoing protocol execution.

        Returns:
            None
        """
        # Construct the SSH command
        script = "cancel.py"
        file = f"/var/lib/jupyter/notebooks/execute_program.sh {script}"
        command = f"sh -l -c '{file}'"

        # Transfer python file
        py_file = fr"{self._tab.path}\Scripts\Back_End\OT2_Control\OT2_Programs\{script}"
        py_transfer = fr"scp -i {self._tab.run.key} {py_file} root@{self._tab.run.ip}:/var/lib/jupyter/notebooks"
        os.system(py_transfer)

        cancel_run = Thread(target=self._call_cancel,
                            args=(command,))
        cancel_run.start()

    def _call_cancel(self, command: str) -> None:
        """
        Execute a cancel command via SSH to interrupt protocol execution.

        Parameters:
            command (str): The cancel command.

        Returns:
            None
        """
        self._conn = subprocess.Popen(
            [
                "ssh",
                "-t",
                "-i",
                self._tab.run.key,
                f"root@{str(self._tab.run.ip)}",
                command
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            stdin=subprocess.PIPE,
            bufsize=1,
            universal_newlines=True,
            shell=False
        )

This slows down our cancel by about 10 seconds. If you store the cancel script on the robot itself and execute the cancel directly through the command line, it's pretty much instantaneous. Hopefully this helps!

hibazou commented 4 months ago

Hi! The way I ended up solving this for my purposes was not using keyboard interrupt; it was simpler for us to simply kill the process on the robot and then home the robot. Here's the code we run:

Thank you !