Opentrons / opentrons

Software for writing protocols and running them on the Opentrons Flex and Opentrons OT-2
https://opentrons.com
Apache License 2.0
414 stars 177 forks source link

bug: Threads spawned by a protocol will keep running when that protocol is canceled #7742

Open SyntaxColoring opened 3 years ago

SyntaxColoring commented 3 years ago

Overview

Some protocols spawn their own threads. Our Science and Applications Engineering teams have numerous examples of these. A common use case seems to be flashing the rail lights in a secondary thread while the main thread blocks on a protocol.pause() or protocol.delay().

The problem is that if you cancel a protocol before it ends, any secondary threads that it spawned will generally leak, and keep running on the OT-2 until the OT-2 is restarted.

Steps to reproduce

Upload and run this protocol. Cancel it during the delay step.

import threading
import time

metadata = {
    'apiLevel': '2.9'
}

def flash_lights_infinitely(protocol):
    while True:
        protocol.set_rail_lights(True)
        time.sleep(1)
        protocol.set_rail_lights(False)
        time.sleep(1)

def run(protocol):
    tip_rack = protocol.load_labware('opentrons_96_tiprack_300ul', 8)
    pipette = protocol.load_instrument('p300_multi_gen2', 'left')

    if not protocol.is_simulating():
        thread = threading.Thread(target=flash_lights_infinitely, args=(protocol,))
        thread.start()

    pipette.pick_up_tip(tip_rack.wells()[0])
    pipette.drop_tip(tip_rack.wells()[0])
    protocol.delay(minutes=5)
    pipette.pick_up_tip(tip_rack.wells()[0])
    pipette.drop_tip(tip_rack.wells()[0])

Current behavior

With the above protocol...

  1. The rail lights will start flashing when the protocol starts.
  2. After you cancel the protocol, the lights will temporarily stop flashing as the gantry homes. (I presume this is because some lock is held during the entire homing process.)
  3. After the gantry finishes homing, when the OT-2 is idle, the lights will continue flashing, indicating the protocol-spawned thread is still running.
  4. Additionally, if you then upload and run a different protocol, the lights will continue flashing.

Expected behavior

Either:

Or:

SyntaxColoring commented 3 years ago

In general, after canceling a protocol, all the threads spawned by that protocol should be automatically terminated.

I think implementing this would require us to execute Python protocols in their own dedicated processes, instead of within the robot-server process like we do today. Python threads can't generally be killed without their cooperation, so we'd need to silo them into their own processes to SIGTERM/SIGKILL them.

We on the CPX squad want to do this for APIv3. For APIv2, I presume it would be harder, but I haven't looked at the code, so I don't know specifically how hard.

https://docs.opentrons.com should explain this gotcha and suggest a protocol-level workaround.

Sketch of possible workaround:

SyntaxColoring commented 3 years ago

Here's a protocol-level workaround:

import contextlib
import threading

from opentrons import protocol_api

metadata = {
    'apiLevel': '2.5'  # Required for set_rail_lights() and rail_lights_on.
}

@contextlib.contextmanager
def flashing_rail_lights(
    protocol: protocol_api.ProtocolContext, seconds_per_flash_cycle=1.0
):
    """Flash the rail lights on and off in the background.

    Source: https://github.com/Opentrons/opentrons/issues/7742

    Example usage:

        # While the robot is doing nothing for 2 minutes, flash lights quickly.
        with flashing_rail_lights(protocol, seconds_per_flash_cycle=0.25):
            protocol.delay(minutes=2)

    When the ``with`` block exits, the rail lights are restored to their original
    state.

    Exclusive control of the rail lights is assumed. For example, within the ``with``
    block, you must not call `ProtocolContext.set_rail_lights` yourself, inspect
    `ProtocolContext.rail_lights_on`, or nest additional calls to
    `flashing_rail_lights`.
    """
    original_light_status = protocol.rail_lights_on

    stop_flashing_event = threading.Event()

    def background_loop():
        while True:
            protocol.set_rail_lights(not protocol.rail_lights_on)
            # Wait until it's time to toggle the lights for the next flash or we're
            # told to stop flashing entirely, whichever comes first.
            got_stop_flashing_event = stop_flashing_event.wait(
                timeout=seconds_per_flash_cycle/2
            )
            if got_stop_flashing_event:
                break

    background_thread = threading.Thread(
        target=background_loop, name="Background thread for flashing rail lights"
    )

    try:
        if not protocol.is_simulating():
            background_thread.start()
        yield

    finally:
        # The ``with`` block might be exiting normally, or it might be exiting because
        # something inside it raised an exception.
        #
        # This accounts for user-issued cancelations because currently (2021-05-04),
        # the Python Protocol API happens to implement user-issued cancellations by
        # raising an exception from internal API code.
        if not protocol.is_simulating():
            stop_flashing_event.set()
            background_thread.join()

        # This is questionable: it may issue a command to the API while the API
        # is in an inconsistent state after raising an exception.
        protocol.set_rail_lights(original_light_status)

def run(protocol: protocol_api.ProtocolContext):
    tip_rack = protocol.load_labware("opentrons_96_tiprack_300ul", 1)
    pipette = protocol.load_instrument("p300_multi_gen2", mount="left")

    pipette.pick_up_tip(tip_rack.wells()[0])

    with flashing_rail_lights(protocol, seconds_per_flash_cycle=0.5):
        protocol.delay(minutes=2)

    pipette.return_tip(tip_rack.wells()[0])

I have a few 👍s in Slack for officially recommending this to Opentrons' internal Support, Science, and Applications Engineering teams.

We're not yet sure whether we want to document this workaround for external users, though. We might instead want to add a flash_lights parameter to pause() and delay() and encourage people to use that, for example.