INTER: mirror in and setting sm_angle race condition [TIMEBOX: 4.5 days]

rerpha commented 1 month ago

when:

putting SM IN, then immediately changing sm_angle (or within ~5s), then sample_y not updating based on the second sm_angle - it seems to use the first angle when setting SM IN instead of the new one. this isn't consistent and sometimes works OK. I think it's an issue within the reflectometry server where we aren't syncing the beam path frequently enough.

acceptance criteria

[ ] unit test is added to cover this situation
[ ] cause of issue is identified
[ ] issue is fixed
[ ] create follow on ticket if timebox ended before all acceptance criteria completed

notes

we can simulate an INTER beamline either by simulating the whole thing through a twincat instance or just by using galils in RECSIM. this may help to solve the issue. If not it is somewhat reproducible on the real INTER.

Discussed ~1:36 20/06/24

Tom-Willemsen commented 5 days ago

This has been narrowed down to a beckhoff problem rather than a reflectometry server problem (sample_y is a beckhoff virtual axis on INTER). @rerpha and I will draft an email to motion team tomorrow about it with some diagnostics we've been able to collect.

Specifically, to reproduce:

Set up INTER using beckhoff-level simulation (problem does not occur with a beckhoff real-axis simulation as per the existing beckhoff system tests - it needs virtual axes to cause the problem)
Change sample_y (or another virtual axis such as measurement_angle) to 20
caput MOT:MTR0802 19 && caput MOT:MTR0802 18 (assuming MOT:MTR0802 is sample_y)
- i.e. a move to 19 that then gets retargeted to 18 mid-move
Beckhoff virtual axes will correctly receive a new setpoint of 18 after the initial move finishes (checked using XAE) but will ignore it. Beckhoff "real" axes and galils etc will correctly move to 18.
- Note the motor record handles the retargeting stuff, so 18 isn't actually sent to the beckhoff immediately, rather it is sent soon after the initial move to 19 completes. Delay is approx 70ms. Therefore I think the problem here is that the beckhoff is reporting that the initial move is "complete" too soon - before it is actually ready to accept a subsequent move command.

Notes to self if we want to try to hide/compensate for this problem in the IBEX side rather than fixing it on the beckhoff end:

Try .DLY field of motor record - see if this helps? Unclear whether this delay would get added between two moves caused by retargeting, or whether it will only get added right at the end.
Enabling retries on the relevant axis might be enough to hide the problem, the delay between multiple retries might be sufficient to allow the beckhoff to accept a new move. This is really hacky and horrible though, and not sure whether it might cause some nasty side-effects.

rerpha commented 1 day ago

we narrowed this down to not being refl server related.

the issue seems to be two SPs in quick succession to the beckhoff virtual axes, which seem to get registered as SPs but never executed. we have been testing this without epics entirely to prove it with scripts like this:

import pyads
import time

def move(plc, pos, axis, velocity):
    # does exactly what the motor record does for an absolute move
    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.fVelocity", velocity)
    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.fPosition", pos)

    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.eCommand", 0, pyads.PLCTYPE_INT)
    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.bExecute", 1)

# connect to plc and open connection

def poll_instead_of_sleeping(plc, axis, t):
    start = time.time()
    n = 0
    while time.time() < start + t:
        plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.bMoving")
        n += 1

    print(f"spammed {n} bmobings")

PRE_DELAY = 0.5

def waitfor_move(plc, axis, post_delay):
    start = time.time()
    time.sleep(PRE_DELAY)
    while True:
        time.sleep(0.2)
        is_moving = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.bMoving")
        pos = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        sp = plc.read_by_name(f"GVL.astAxes[{axis}].stControl.fPosition")

        if is_moving != 1:
            break
    time.sleep(post_delay)

    end = time.time()
    print(f"waitfor_move time {end-start}, final pos {pos}, final sp {sp}")

    return (pos, sp)

def do_test(post_delay):

    plc = pyads.Connection('192.168.0.25.1.1', 852)
    with plc:
        axis = 26

        plc.write_by_name(f"GVL.astAxes[{axis}].stControl.bReset", 1)

        p = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        v = 2
        print(f"start position is {p}")

        #initial move 
        move(plc, 20, axis, v)

        # wait a bit
        #time.sleep(5)
        # poll_instead_of_sleeping(plc, axis, 5)
        (p, sp) = waitfor_move(plc, axis, post_delay)
        if sp != 20.0:
            print("IT DIDN'T TAKE THE STUPID SP")
        if p < 19.99 or p > 20.01:
            return False

        # final_pos = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        # if final_pos < 19.99 or final_pos > 20.01:
        #     return False

        # send three moves in succession
        move(plc, 18, axis, v)

        (p, sp) = waitfor_move(plc, axis, post_delay)
        if sp != 18.0:
            print("IT DIDN'T TAKE THE STUPID SP MK2")
        if p < 17.99 or p > 18.01:
            return False

        # for _ in range(100):
        #     plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.bMoving")

        error_num =plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.nErrorID")

        print(f"errors are {error_num}")

        move(plc, 16, axis, v)

        (p, sp) = waitfor_move(plc, axis, post_delay)

        final_pos = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        final_sp = plc.read_by_name(f"GVL.astAxes[{axis}].stControl.fPosition")

        print(f"final sp is {final_sp}, position is {final_pos}")

        return 15.99 < final_pos < 16.01

def do_full_test():
    m = {}
    for post_delay in range(0, 5000, 2):
        print(f"Testing post_delay {post_delay}")
        n = 0
        passed = 0
        for it in range(50):
            print(it)
            good = do_test(post_delay/1000.0)

            if good:
                passed += 1
            n += 1

            print(f"{100*(passed/n)}% good")

        m[post_delay] = 100*(passed/n)
        print(m)

do_full_test()

which produced a plot like this:

pyadstest_postmovedelaypluspollratefix

which suggests that a 200ms delay after a SP sent seems to be the sweet spot for not having moves ignored.

some of this may be caused by polling the beckhoff which slows things down - #7618 helped thisby reducing the poll rate from 1ms to 200ms so this may help further ( though can't easily be patched onto the beckhoff instruments )

A settle time can be added to a move from the motor record by setting the .DLY field. This script has been testing successful moves for this and this seems to stop it from happening when .DLY is set to 0.5 (500ms):

from genie_python import genie as g

MTR = "MOT:MTR0802"
beckhoff_axis = 26

beckhoff_errors = 0
position_errors = []
success = 0
unknown_errors = 0

g.toggle.exceptions_raised(True)

def reset_all_beckhoff_errors():
    for i in range (1, 30+1):
        g.set_pv(f"TC_01:ASTAXES_{i}:STCONTROL-BRESET", 1, is_local=True)
        g.set_pv(f"TC_01:ASTAXES_{i}:STCONTROL-BENABLE", 1, is_local=True)

def move_and_check():
    global beckhoff_errors, success
    # clear beckhoff errors
    reset_all_beckhoff_errors()

    # move to the initial position
    g.set_pv(MTR + ".VAL", 30, is_local=True)
    g.waitfor_move()
    err = g.get_pv(f"TC_01:ASTAXES_{beckhoff_axis}:STSTATUS-BERROR", is_local=True)
    if  err != 0:
        beckhoff_errors += 1
        return

    pos = g.get_pv(MTR + ".RBV", is_local=True)
    assert 29.99 < pos < 30.01, f"initial: {pos}"

    g.set_pv(MTR + ".VAL", 19, is_local=True)
    import time
    time.sleep(1)
    g.set_pv(MTR + ".VAL", 18, is_local=True)
    g.waitfor_move()
    err = g.get_pv(f"TC_01:ASTAXES_{beckhoff_axis}:STSTATUS-BERROR", is_local=True)
    if  err != 0:
        beckhoff_errors += 1
        return
    pos = g.get_pv(MTR + ".RBV", is_local=True)
    assert 17.99 < pos < 18.01, f"final: {pos}"
    success += 1

def test_race():
    global success
    global unknown_errors
    global beckhoff_errors
    while (True):
        try:
            move_and_check()
        except AssertionError as e:
            position_errors.append(e)
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"unexpected exception: {e.__class__.__name__}: {str(e)}")
            unknown_errors += 1

        print(f"num position errors: {len(position_errors)}, num unknown errors: {unknown_errors}, success = {success} beckhoff_errors = {beckhoff_errors}")

    print(position_errors)

we are now testing this for less settle time (200ms) to reduce waitfor delays etc. without losing robustness. Edit: 200ms is fine as well - we'll use this for all beckhoff axes going forwards.

Tom-Willemsen commented 1 day ago

INTER emailed to ask for some testing time with real hardware.

rerpha commented 1 day ago

we may be able to change https://github.com/ISISComputingGroup/EPICS-motor/blob/master/motorApp/Db/basic_asyn_motor.db if not used elsewhere other than TC and just pass in DLY of 0.2 in https://github.com/ISISComputingGroup/EPICS-ioc/blob/master/TC/TC-IOC-01App/Db/single_axis.substitutions

ISISComputingGroup / IBEX

INTER: mirror in and setting sm_angle race condition [TIMEBOX: 4.5 days] #8339

acceptance criteria

notes