DiamondLightSource / pythonSoftIOC

Embed an EPICS IOC in a Python process
Apache License 2.0
31 stars 9 forks source link

Possible deadlock when record.set called in on_update callback #119

Closed GDYendell closed 9 months ago

GDYendell commented 1 year ago

We have had an issue where a python IOC would hang sometimes and have to be restarted. We narrowed it down to calling my_record.set() within the on_update callback for my_record and removing that has fixed the problem.

I have added a branch with a minimal example to reproduce the problem. The instructions in the README should be sufficient to run it. The traceback provoked by faulthandler consistently shows the IOC stuck in db_put_field:

Thread 0x00007f66f4cef700 (most recent call first):
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/cothread/cothread.py", line 984 in __call__
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/cothread_dispatcher.py", line 32 in __call__
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/device.py", line 221 in _process
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/device_core.py", line 165 in __call_instance_method
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/device_core.py", line 145 in <lambda>

Current thread 0x00007f6728ed8740 (most recent call first):
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/imports.py", line 24 in db_put_field
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/device.py", line 243 in set
  File "/scratch/development/pythonSoftIoc/example/cothread_ioc.py", line 10 in useless_callback
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/softioc/cothread_dispatcher.py", line 29 in wrapper
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/cothread/cothread.py", line 971 in callback_events
  File "/scratch/development/pythonSoftIoc/venv/lib/python3.7/site-packages/cothread/cothread.py", line 671 in __run

This was tested with base 7.0.7.0 and it does not seem to happen with 3.14.12.7.

coretl commented 1 year ago

I think the following happens (@Araneidae correct me if I'm wrong):

@mdavidsaver this used to work in 3.14.12.7, but now fails in 7.0.7.0. Can you think what might be going wrong here?

AlexanderWells-diamond commented 9 months ago

I've investigated this a bit, and had a few more observations to add:

mdavidsaver commented 9 months ago

I don't think I can follow all of the details, but I see a call to dbPutField(). This function must not be called from a device support callback, or from any other context in which any record lock is held. For "virtual" access to fields within a device support callback, dbPut() is the way to go.

Also, it looks like your db_put_field() wrapper has the GIL locked when calling dbPutField(). Unless you are zealous to avoid locking the GIL in a device support callback, this is constitutes a lock order violation.

Araneidae commented 9 months ago

The dbPutField() calls are in fact correct, because the on_update callback is called after record processing has completed.

However, thank you very much for the GIL spot! Yep, we need to drop the GIL around field access calls, and that makes a lot of sense. Am pretty sure that's fixed things now.

Araneidae commented 9 months ago

This will have been a long standing bug affecting pythonSoftIOC since the earliest days, and is probably capable of being triggered any time record processing is triggered at the precise instance that .set() is being called on an Out record. This is a perfect example of a timing bug that can be next to impossible to reproduce.

I am very happy that this has been pinned down and fixed!

GDYendell commented 9 months ago

Thanks everyone! I will delete the example branch now that there are test cases for this.