ISISComputingGroup / IBEX

Top level repository for IBEX stories
4 stars 2 forks source link

INTER: mirror in and setting sm_angle race condition [TIMEBOX: 4.5 days] #8339

Open rerpha opened 1 month ago

rerpha commented 1 month ago

when:

putting SM IN, then immediately changing sm_angle (or within ~5s), then sample_y not updating based on the second sm_angle - it seems to use the first angle when setting SM IN instead of the new one. this isn't consistent and sometimes works OK. I think it's an issue within the reflectometry server where we aren't syncing the beam path frequently enough.

acceptance criteria

notes

we can simulate an INTER beamline either by simulating the whole thing through a twincat instance or just by using galils in RECSIM. this may help to solve the issue. If not it is somewhat reproducible on the real INTER.

Discussed ~1:36 20/06/24

Tom-Willemsen commented 5 days ago

This has been narrowed down to a beckhoff problem rather than a reflectometry server problem (sample_y is a beckhoff virtual axis on INTER). @rerpha and I will draft an email to motion team tomorrow about it with some diagnostics we've been able to collect.

Specifically, to reproduce:

Notes to self if we want to try to hide/compensate for this problem in the IBEX side rather than fixing it on the beckhoff end:

rerpha commented 1 day ago

we narrowed this down to not being refl server related.

the issue seems to be two SPs in quick succession to the beckhoff virtual axes, which seem to get registered as SPs but never executed. we have been testing this without epics entirely to prove it with scripts like this:

import pyads
import time

def move(plc, pos, axis, velocity):
    # does exactly what the motor record does for an absolute move
    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.fVelocity", velocity)
    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.fPosition", pos)

    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.eCommand", 0, pyads.PLCTYPE_INT)
    plc.write_by_name(f"GVL.astAxes[{axis}].stControl.bExecute", 1)

# connect to plc and open connection

def poll_instead_of_sleeping(plc, axis, t):
    start = time.time()
    n = 0
    while time.time() < start + t:
        plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.bMoving")
        n += 1

    print(f"spammed {n} bmobings")

PRE_DELAY = 0.5

def waitfor_move(plc, axis, post_delay):
    start = time.time()
    time.sleep(PRE_DELAY)
    while True:
        time.sleep(0.2)
        is_moving = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.bMoving")
        pos = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        sp = plc.read_by_name(f"GVL.astAxes[{axis}].stControl.fPosition")

        if is_moving != 1:
            break
    time.sleep(post_delay)

    end = time.time()
    print(f"waitfor_move time {end-start}, final pos {pos}, final sp {sp}")

    return (pos, sp)

def do_test(post_delay):

    plc = pyads.Connection('192.168.0.25.1.1', 852)
    with plc:
        axis = 26

        plc.write_by_name(f"GVL.astAxes[{axis}].stControl.bReset", 1)

        p = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        v = 2
        print(f"start position is {p}")

        #initial move 
        move(plc, 20, axis, v)

        # wait a bit
        #time.sleep(5)
        # poll_instead_of_sleeping(plc, axis, 5)
        (p, sp) = waitfor_move(plc, axis, post_delay)
        if sp != 20.0:
            print("IT DIDN'T TAKE THE STUPID SP")
        if p < 19.99 or p > 20.01:
            return False

        # final_pos = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        # if final_pos < 19.99 or final_pos > 20.01:
        #     return False

        # send three moves in succession
        move(plc, 18, axis, v)

        (p, sp) = waitfor_move(plc, axis, post_delay)
        if sp != 18.0:
            print("IT DIDN'T TAKE THE STUPID SP MK2")
        if p < 17.99 or p > 18.01:
            return False

        # for _ in range(100):
        #     plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.bMoving")

        error_num =plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.nErrorID")

        print(f"errors are {error_num}")

        move(plc, 16, axis, v)

        (p, sp) = waitfor_move(plc, axis, post_delay)

        final_pos = plc.read_by_name(f"GVL.astAxes[{axis}].stStatus.fActPosition")
        final_sp = plc.read_by_name(f"GVL.astAxes[{axis}].stControl.fPosition")

        print(f"final sp is {final_sp}, position is {final_pos}")

        return 15.99 < final_pos < 16.01

def do_full_test():
    m = {}
    for post_delay in range(0, 5000, 2):
        print(f"Testing post_delay {post_delay}")
        n = 0
        passed = 0
        for it in range(50):
            print(it)
            good = do_test(post_delay/1000.0)

            if good:
                passed += 1
            n += 1

            print(f"{100*(passed/n)}% good")

        m[post_delay] = 100*(passed/n)
        print(m)

do_full_test()

which produced a plot like this:

pyadstest_postmovedelaypluspollratefix

which suggests that a 200ms delay after a SP sent seems to be the sweet spot for not having moves ignored.

some of this may be caused by polling the beckhoff which slows things down - #7618 helped thisby reducing the poll rate from 1ms to 200ms so this may help further ( though can't easily be patched onto the beckhoff instruments )

A settle time can be added to a move from the motor record by setting the .DLY field. This script has been testing successful moves for this and this seems to stop it from happening when .DLY is set to 0.5 (500ms):

from genie_python import genie as g

MTR = "MOT:MTR0802"
beckhoff_axis = 26

beckhoff_errors = 0
position_errors = []
success = 0
unknown_errors = 0

g.toggle.exceptions_raised(True)

def reset_all_beckhoff_errors():
    for i in range (1, 30+1):
        g.set_pv(f"TC_01:ASTAXES_{i}:STCONTROL-BRESET", 1, is_local=True)
        g.set_pv(f"TC_01:ASTAXES_{i}:STCONTROL-BENABLE", 1, is_local=True)

def move_and_check():
    global beckhoff_errors, success
    # clear beckhoff errors
    reset_all_beckhoff_errors()

    # move to the initial position
    g.set_pv(MTR + ".VAL", 30, is_local=True)
    g.waitfor_move()
    err = g.get_pv(f"TC_01:ASTAXES_{beckhoff_axis}:STSTATUS-BERROR", is_local=True)
    if  err != 0:
        beckhoff_errors += 1
        return

    pos = g.get_pv(MTR + ".RBV", is_local=True)
    assert 29.99 < pos < 30.01, f"initial: {pos}"

    g.set_pv(MTR + ".VAL", 19, is_local=True)
    import time
    time.sleep(1)
    g.set_pv(MTR + ".VAL", 18, is_local=True)
    g.waitfor_move()
    err = g.get_pv(f"TC_01:ASTAXES_{beckhoff_axis}:STSTATUS-BERROR", is_local=True)
    if  err != 0:
        beckhoff_errors += 1
        return
    pos = g.get_pv(MTR + ".RBV", is_local=True)
    assert 17.99 < pos < 18.01, f"final: {pos}"
    success += 1

def test_race():
    global success
    global unknown_errors
    global beckhoff_errors
    while (True):
        try:
            move_and_check()
        except AssertionError as e:
            position_errors.append(e)
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"unexpected exception: {e.__class__.__name__}: {str(e)}")
            unknown_errors += 1

        print(f"num position errors: {len(position_errors)}, num unknown errors: {unknown_errors}, success = {success} beckhoff_errors = {beckhoff_errors}")

    print(position_errors)

we are now testing this for less settle time (200ms) to reduce waitfor delays etc. without losing robustness. Edit: 200ms is fine as well - we'll use this for all beckhoff axes going forwards.

Tom-Willemsen commented 1 day ago

INTER emailed to ask for some testing time with real hardware.

rerpha commented 1 day ago

we may be able to change https://github.com/ISISComputingGroup/EPICS-motor/blob/master/motorApp/Db/basic_asyn_motor.db if not used elsewhere other than TC and just pass in DLY of 0.2 in https://github.com/ISISComputingGroup/EPICS-ioc/blob/master/TC/TC-IOC-01App/Db/single_axis.substitutions