Open AlexanderWells-diamond opened 1 month ago
@evalott100 I think we've seen this on the PandA IOC sometimes...
Some further thoughts on this topic:
PACT
field. This would mean the end user would need to write (a) "reset" record(s) to be able to fix any potentially hung processingSome further thoughts on this topic:
* It may be possible to rescue a record by using database puts, rather than channel access puts, directly to the affected `PACT` field. This would mean the end user would need to write (a) "reset" record(s) to be able to fix any potentially hung processing * We could simply catch all exceptions and always ensure we call the completion callback. This would mean the IOC is always fully operational. But there are certain classes of fatal errors that may simply mean we never want to try again - this fix would mean we could introduce crash loops, especially in fast processing records.
I would say that logging the exception then issuing the completion would be the right thing to do by default. We could give people a UnrecoverableException
that they could raise to bypass that if we ever came across a use case where we needed it
I suppose the following makes sense but just thought it would be worth mentioning...
This only seems to be a problem if blocking=True
, with the ioc side
Python 3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:57:19) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> fail func
ERROR:root:Exception when running dispatched callback
Traceback (most recent call last):
File "/scratch/twj43146/Programming/PandABlocks-ioc/.venv/lib/python3.11/site-packages/softioc/asyncio_dispatcher.py", line 43, in async_wrapper
ret = func(*func_args)
^^^^^^^^^^^^^^^^
File "/scratch/twj43146/Programming/PandABlocks-ioc/tester.py", line 18, in fail_func
raise Exception("On update fails")
Exception: On update fails
and the client side:
$ caput -c MY-DEVICE-PREFIX:FAILS 1
Old : MY-DEVICE-PREFIX:FAILS 0
New : MY-DEVICE-PREFIX:FAILS 1
$ caput -c MY-DEVICE-PREFIX:FAILS 2
Old : MY-DEVICE-PREFIX:FAILS 1
CA.Client.Exception...............................................
Warning: "Identical process variable names on multiple servers"
Context: "Channel: "MY-DEVICE-PREFIX:FAILS", Connecting to: 172.23.244.139:5064, Ignored: 192.168.122.1:5064"
Source File: ../cac.cpp line 1308
Current Time: Mon Aug 19 2024 13:16:22.444075832
..................................................................
Write callback operation timed out
New : MY-DEVICE-PREFIX:FAILS 1
However if we set blocking=False
, we get the same ioc error...
(InteractiveConsole)
>>> fail func
ERROR:root:Exception when running dispatched callback
Traceback (most recent call last):
File "/scratch/twj43146/Programming/PandABlocks-ioc/.venv/lib/python3.11/site-packages/softioc/asyncio_dispatcher.py", line 43, in async_wrapper
ret = func(*func_args)
^^^^^^^^^^^^^^^^
File "/scratch/twj43146/Programming/PandABlocks-ioc/tester.py", line 18, in fail_func
raise Exception("On update fails")
Exception: On update fails
fail func
but the client put doesn't fail on the second write:
$ caput -c MY-DEVICE-PREFIX:FAILS 1
Old : MY-DEVICE-PREFIX:FAILS 0
New : MY-DEVICE-PREFIX:FAILS 1
$ caput -c MY-DEVICE-PREFIX:FAILS 2
Old : MY-DEVICE-PREFIX:FAILS 1
New : MY-DEVICE-PREFIX:FAILS 2
If an exception occurs during processing of an
on_update
callback, the record will be left with itsPACT
flag set to true. This blocks all record processing. When the record enters this state, there appears to be no way to rescue the record - even a put toPACT
with value 0 will not work (it seems to be ignored).This issue happens with both Cothread and Asyncio dispatchers, as both follow the same pattern of skipping the "completion" callback if an exception occurs.
I'm unsure what the correct behaviour is. The AppDevGuide, in chapter 5.9 on page 90, says
if dbProcess finds the record active 10 times in succession, it raises a SCAN_ALARM
. This is more to do with detecting infinite loops, rather than crashed processing, but I don't see any specific documentation on this topic.To demonstrate this issue run the below IOC:
And then on a command line run:
The second one will fail with a "Write callback operation timed out" error, and the value will not have updated.