Open coretl opened 1 year ago
I've done a little digging; this started becoming an issue when I first added p4p
to the pyproject.toml
dependencies, on the 3rd of July, when this PR was merged: https://github.com/bluesky/ophyd-epics-devices/pull/24
Initially, the tests were failing due to area detector stuff. What I failed to spot, is when @coretl fixed the area detector stuff then this seg fault was introduced. It would have gone undetected because I think I was using a container with python 3.9 when I was developing it...
The latest version of p4p is 4.1.9, which was released on the 25th of July. This requires pvxslibs<1.3.0a1,>=1.2.2... and the latest version of pvxslibs is 1.2.2 which was released on the 3rd of July.
The CI for p4p seems to pass on python 3.10: https://github.com/mdavidsaver/p4p/actions/runs/5660869421/job/15337538188 with pvxslibs at 1.2.2, which leads me to believe that although we definitely shouldn't be getting segfaults, we must be doing something naughty with p4p.
I've tried going all the way back to p4p==4.1.5 and pvsxlibs==1.1.4 (effectively April this year) and still I'm getting the segfault. So either we're doing something silly or, p4p has always had this problem and it's gone under the radar. I'll investigate.
You could try doing atexit.register(context.close)
on the global context. It might be something to do with the fact we have one context in ophyd/v2/_p4p.py
and another one in ophyd_epics_devices/panda.py
, so maybe try switching to using the one from ophyd and see if the problem goes away? It might also be something about destroying the context in a different thread than the one it is created in. Just a guess though, no evidence to support this...
Ive found something weird while investigating... If I have a test that just makes a panda, i.e.
def test_...():
my_panda = PandA("something")
if I add a __del__
method to the PandA, it gets called. However, if I add the DeviceCollector with block:
def test_something():
async with DeviceCollector(sim=True):
my_panda = PandA("something")
then the __del__
method never gets called.
you're probably right r.e. two contexts, but I thought I'd try and see if I could fix it by managing my context better. From the docs, it looks like I should be calling .close()
and .disconnect()
on the context, for example. I thought I'd do that in the __del__
method, which gets called when the reference count of an object goes to 0. Above, it seems like the DeviceCollector increases the reference count of whatever is being called in it. That sounds problematic. Am I missing something?
Okay, I've fixed it now using your suggestion and there is a PR coming up for it
then the
__del__
method never gets called. you're probably right r.e. two contexts, but I thought I'd try and see if I could fix it by managing my context better. From the docs, it looks like I should be calling.close()
and.disconnect()
on the context, for example. I thought I'd do that in the__del__
method, which gets called when the reference count of an object goes to 0. Above, it seems like the DeviceCollector increases the reference count of whatever is being called in it. That sounds problematic. Am I missing something?
You're right, DeviceCollector is holding onto a reference: https://github.com/bluesky/ophyd/blob/5c413cc0848ef035b9839de328064d6af47b589b/ophyd/v2/core.py#L317C40-L317C40
It should probably clear that dict when it's done with it...
Not sure why, looks like a seg fault in garbage collect, so I guess we pin various modules until it works. I would start with
p4p
andpvxslibs
. I put something in thecheck-segfaults
branch that will print the backtrace