bluesky / ophyd-epics-devices

https://bluesky.github.io/ophyd-epics-devices
Apache License 2.0
1 stars 1 forks source link

Seg faults in python 3.10 #32

Open coretl opened 1 year ago

coretl commented 1 year ago

Not sure why, looks like a seg fault in garbage collect, so I guess we pin various modules until it works. I would start with p4p and pvxslibs. I put something in the check-segfaults branch that will print the backtrace

rosesyrett commented 1 year ago

I've done a little digging; this started becoming an issue when I first added p4p to the pyproject.toml dependencies, on the 3rd of July, when this PR was merged: https://github.com/bluesky/ophyd-epics-devices/pull/24

Initially, the tests were failing due to area detector stuff. What I failed to spot, is when @coretl fixed the area detector stuff then this seg fault was introduced. It would have gone undetected because I think I was using a container with python 3.9 when I was developing it...

The latest version of p4p is 4.1.9, which was released on the 25th of July. This requires pvxslibs<1.3.0a1,>=1.2.2... and the latest version of pvxslibs is 1.2.2 which was released on the 3rd of July.

The CI for p4p seems to pass on python 3.10: https://github.com/mdavidsaver/p4p/actions/runs/5660869421/job/15337538188 with pvxslibs at 1.2.2, which leads me to believe that although we definitely shouldn't be getting segfaults, we must be doing something naughty with p4p.

I've tried going all the way back to p4p==4.1.5 and pvsxlibs==1.1.4 (effectively April this year) and still I'm getting the segfault. So either we're doing something silly or, p4p has always had this problem and it's gone under the radar. I'll investigate.

coretl commented 1 year ago

You could try doing atexit.register(context.close) on the global context. It might be something to do with the fact we have one context in ophyd/v2/_p4p.py and another one in ophyd_epics_devices/panda.py, so maybe try switching to using the one from ophyd and see if the problem goes away? It might also be something about destroying the context in a different thread than the one it is created in. Just a guess though, no evidence to support this...

rosesyrett commented 1 year ago

Ive found something weird while investigating... If I have a test that just makes a panda, i.e.

def test_...():
    my_panda = PandA("something")

if I add a __del__ method to the PandA, it gets called. However, if I add the DeviceCollector with block:

def test_something():
    async with DeviceCollector(sim=True):
         my_panda = PandA("something")

then the __del__ method never gets called. you're probably right r.e. two contexts, but I thought I'd try and see if I could fix it by managing my context better. From the docs, it looks like I should be calling .close() and .disconnect() on the context, for example. I thought I'd do that in the __del__ method, which gets called when the reference count of an object goes to 0. Above, it seems like the DeviceCollector increases the reference count of whatever is being called in it. That sounds problematic. Am I missing something?

rosesyrett commented 1 year ago

Okay, I've fixed it now using your suggestion and there is a PR coming up for it

coretl commented 1 year ago

then the __del__ method never gets called. you're probably right r.e. two contexts, but I thought I'd try and see if I could fix it by managing my context better. From the docs, it looks like I should be calling .close() and .disconnect() on the context, for example. I thought I'd do that in the __del__ method, which gets called when the reference count of an object goes to 0. Above, it seems like the DeviceCollector increases the reference count of whatever is being called in it. That sounds problematic. Am I missing something?

You're right, DeviceCollector is holding onto a reference: https://github.com/bluesky/ophyd/blob/5c413cc0848ef035b9839de328064d6af47b589b/ophyd/v2/core.py#L317C40-L317C40

It should probably clear that dict when it's done with it...