DiamondLightSource / pythonSoftIOC

Embed an EPICS IOC in a Python process
Apache License 2.0
31 stars 9 forks source link

Exceptions in on-update processing can leave record in "active" mode #170

Open AlexanderWells-diamond opened 1 month ago

AlexanderWells-diamond commented 1 month ago

If an exception occurs during processing of an on_update callback, the record will be left with its PACT flag set to true. This blocks all record processing. When the record enters this state, there appears to be no way to rescue the record - even a put to PACT with value 0 will not work (it seems to be ignored).

This issue happens with both Cothread and Asyncio dispatchers, as both follow the same pattern of skipping the "completion" callback if an exception occurs.

I'm unsure what the correct behaviour is. The AppDevGuide, in chapter 5.9 on page 90, says if dbProcess finds the record active 10 times in succession, it raises a SCAN_ALARM. This is more to do with detecting infinite loops, rather than crashed processing, but I don't see any specific documentation on this topic.

To demonstrate this issue run the below IOC:

# Import the basic framework components.
from softioc import softioc, builder, asyncio_dispatcher
import asyncio

# Create an asyncio dispatcher, the event loop is now running
dispatcher = asyncio_dispatcher.AsyncioDispatcher()

# Set the record prefix
builder.SetDeviceName("MY-DEVICE-PREFIX")

def success_func(val):
    print("success func")
    pass

def fail_func(val):
    print("fail func")
    if val==1:
        raise Exception("On update fails")

# Create some records
builder.longOut("SUCCEEDS", on_update=success_func, blocking=True, SCAN='1 second')
builder.longOut("FAILS", on_update=fail_func, blocking=True, SCAN='1 second')

# Boilerplate get the IOC started
builder.LoadDatabase()
softioc.iocInit(dispatcher)

# Finally leave the IOC running with an interactive shell.
softioc.interactive_ioc(globals())

And then on a command line run:

caput -c MY-DEVICE-PREFIX:FAILS 1
...
caput -c MY-DEVICE-PREFIX:FAILS 2

The second one will fail with a "Write callback operation timed out" error, and the value will not have updated.

coretl commented 3 weeks ago

@evalott100 I think we've seen this on the PandA IOC sometimes...

AlexanderWells-diamond commented 3 weeks ago

Some further thoughts on this topic:

coretl commented 3 weeks ago

Some further thoughts on this topic:

* It may be possible to rescue a record by using database puts, rather than channel access puts, directly to the affected `PACT` field. This would mean the end user would need to write (a) "reset" record(s) to be able to fix any potentially hung processing

* We could simply catch all exceptions and always ensure we call the completion callback. This would mean the IOC is always fully operational. But there are certain classes of fatal errors that may simply mean we never want to try again - this fix would mean we could introduce crash loops, especially in fast processing records.

I would say that logging the exception then issuing the completion would be the right thing to do by default. We could give people a UnrecoverableException that they could raise to bypass that if we ever came across a use case where we needed it

evalott100 commented 3 weeks ago

I suppose the following makes sense but just thought it would be worth mentioning...

This only seems to be a problem if blocking=True, with the ioc side

Python 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> fail func
ERROR:root:Exception when running dispatched callback
Traceback (most recent call last):
  File "/scratch/twj43146/Programming/PandABlocks-ioc/.venv/lib/python3.11/site-packages/softioc/asyncio_dispatcher.py", line 43, in async_wrapper
    ret = func(*func_args)
          ^^^^^^^^^^^^^^^^
  File "/scratch/twj43146/Programming/PandABlocks-ioc/tester.py", line 18, in fail_func
    raise Exception("On update fails")
Exception: On update fails

and the client side:

$ caput -c MY-DEVICE-PREFIX:FAILS 1 
Old : MY-DEVICE-PREFIX:FAILS         0
New : MY-DEVICE-PREFIX:FAILS         1

$ caput -c MY-DEVICE-PREFIX:FAILS 2
Old : MY-DEVICE-PREFIX:FAILS         1
CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "MY-DEVICE-PREFIX:FAILS", Connecting to: 172.23.244.139:5064, Ignored: 192.168.122.1:5064"
    Source File: ../cac.cpp line 1308
    Current Time: Mon Aug 19 2024 13:16:22.444075832
..................................................................

Write callback operation timed out
New : MY-DEVICE-PREFIX:FAILS         1

However if we set blocking=False, we get the same ioc error...

(InteractiveConsole)
>>> fail func
ERROR:root:Exception when running dispatched callback
Traceback (most recent call last):
  File "/scratch/twj43146/Programming/PandABlocks-ioc/.venv/lib/python3.11/site-packages/softioc/asyncio_dispatcher.py", line 43, in async_wrapper
    ret = func(*func_args)
          ^^^^^^^^^^^^^^^^
  File "/scratch/twj43146/Programming/PandABlocks-ioc/tester.py", line 18, in fail_func
    raise Exception("On update fails")
Exception: On update fails
fail func

but the client put doesn't fail on the second write:

$ caput -c MY-DEVICE-PREFIX:FAILS 1
Old : MY-DEVICE-PREFIX:FAILS         0
New : MY-DEVICE-PREFIX:FAILS         1

$ caput -c MY-DEVICE-PREFIX:FAILS 2
Old : MY-DEVICE-PREFIX:FAILS         1
New : MY-DEVICE-PREFIX:FAILS         2