DiamondLightSource / pythonSoftIOC

Embed an EPICS IOC in a Python process
Apache License 2.0
31 stars 8 forks source link

Bug when Out records self-update on a Raspberry Pi #153

Closed codedump closed 5 months ago

codedump commented 5 months ago

Hello,

there appears to be a bug when using pythonSoftIOC on a Raspberry Pi.

Try out the "hello world" IOC example from the documentation (reproduced below without all the comments and without the aIn part, for better overview); then modify the on_update=... parameter to self-update the aOut record.

This is something that is necessary, for instance, to implement specific EPICS motor fields, e.g. HOMF, which are expected to update their own value back to 0 once they're done.

Here is the IOC in question:

from softioc import softioc, builder
import cothread
builder.SetDeviceName("MY-DEVICE-PREFIX")
ao = builder.aOut('AO', initial_value=12.45, always_update=True, on_update=lambda v: ao.set(7.62))
builder.LoadDatabase()
softioc.iocInit()
softioc.interactive_ioc(globals())

Then trying to actually trigger the self-update, by simply just writing a value -- any value:

$ caput MY-DEVICE-PREFIX:AO 3.14

This is how it fails on a Raspberry Pi v4, with fairly recent Debian 12.1 (I think it's Bookworm):

INFO: PVXS QSRV2 is loaded, permitted, and ENABLED.
Starting iocInit
############################################################################
## EPICS 7.0.7.0
## Rev. 7.0.7.99.0.2
## Rev. Date 7.0.7.99.0.2
############################################################################
iocRun: All initialization complete
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> Asynchronous callback raised uncaught exception
Traceback (most recent call last):
  File "/home/specuser/fifer-venv/lib/python3.11/site-packages/cothread/cothread.py", line 964, in callback_events
    action(*args)
  File "/home/specuser/fifer-venv/lib/python3.11/site-packages/softioc/cothread_dispatcher.py", line 29, in wrapper
    func(*func_args)
  File "/home/specuser/tmp/./foo.py", line 6, in <lambda>
    on_update=lambda v: ao.set(7.62))
                        ^^^^^^^^^^^^
  File "/home/specuser/fifer-venv/lib/python3.11/site-packages/softioc/device.py", line 264, in set
    db_put_field(_record.NAME, dbf_code, data, length)
  File "/home/specuser/fifer-venv/lib/python3.11/site-packages/softioc/imports.py", line 24, in db_put_field
    return _extension.db_put_field(name, dbr_type, pbuffer, length)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python int too large to convert to C ssize_t

The same IOC code works on as expected an x86 based computer.

How to fix this?

Cheers, F.

PS: sorry, the error message somehow got botched while copy-pasting. It's fixed now.

Araneidae commented 5 months ago

I think this might be a bug here: https://github.com/dls-controls/pythonSoftIOC/blob/f2330b6464856d0d05127a79c84ad166a0d47d2d/softioc/extension.c#L103 Specifically, the "n" argument causes pbuffer in the original call to be parsed as an ssize_t, but here we're actually passing a void * ... which ctypes will have passed to us as an integer.

Unfortunately there doesn't seem to be an intptr_t option, so I'm not sure what the correct fix is here, and there isn't a size_t option either. It would be interesting to know the actual value (in hex, please!) of the pbuffer being passed into imports.db_put_field; I'm assuming it's a 64-bit value with the top bit set being passed as an unsigned integer (am a bit surprised to see the top bit set in userspace code, so maybe something else is going on?)

The buffer being passed is constructed as

ctypes.addressof(ctypes.c_double(7.62))

This is a (large) integer that we somehow need to pass to extension.db_put_field as a pointer. Can you print the hex values you get on your system and compare?

I would expect you to be able to reproduce the problem even more simply: does calling ao.set(7.62) straight after .iocInit() also fail in the same way?

Araneidae commented 5 months ago

A closer look at how things are implemented suggests a possible fix, but we'll need to be able to reproduce the problem. Can you be very specific about the target system where you have the bug, as there are a lot of versions of the RPi around now. The output of uname -a would probably be quite helpful.

Looking at the sources for ctypes I see that addressof is implemented thus: https://github.com/python/cpython/blob/d308d33e098d8e176f1e5169225d3cf800ed6aa1/Modules/_ctypes/callproc.c#L1784

return PyLong_FromVoidPtr(((CDataObject *)obj)->b_ptr);

and according to the documentation we can call PyLong_AsVoidPtr() on this to get the origin value back.

Araneidae commented 5 months ago

This might do the trick:

diff --git a/softioc/extension.c b/softioc/extension.c
index fd71687..d14feac 100644
--- a/softioc/extension.c
+++ b/softioc/extension.c
@@ -98,9 +98,12 @@ static PyObject *db_put_field(PyObject *self, PyObject *args)
 {
     const char *name;
     short dbrType;
-    void *pbuffer;
+    PyObject *buffer_ptr;
     long length;
-    if (!PyArg_ParseTuple(args, "shnl", &name, &dbrType, &pbuffer, &length))
+    if (!PyArg_ParseTuple(args, "shOl", &name, &dbrType, &buffer_ptr, &length))
+        return NULL;
+    void *pbuffer = PyLong_AsVoidPtr(buffer_ptr);
+    if (!pbuffer)
         return NULL;

     struct dbAddr dbAddr;
@@ -133,9 +136,12 @@ static PyObject *db_get_field(PyObject *self, PyObject *args)
 {
     const char *name;
     short dbrType;
-    void *pbuffer;
+    PyObject *buffer_ptr;
     long length;
-    if (!PyArg_ParseTuple(args, "shnl", &name, &dbrType, &pbuffer, &length))
+    if (!PyArg_ParseTuple(args, "shOl", &name, &dbrType, &buffer_ptr, &length))
+        return NULL;
+    void *pbuffer = PyLong_AsVoidPtr(buffer_ptr);
+    if (!pbuffer)
         return NULL;

     struct dbAddr dbAddr;

@AlexanderWells-diamond , can we look into whether we can reproduce this issue? I think the fix above is probably the right way to do things, but it'll need some checking.

codedump commented 5 months ago

A closer look at how things are implemented suggests a possible fix, but we'll need to be able to reproduce the problem. Can you be very specific about the target system where you have the bug, as there are a lot of versions of the RPi around now.

As I said in the bug report, it's a Raspberry Pi v4 with a Debian 12.1.

The output of uname -a would probably be quite helpful.

Linux fifer 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023 aarch64 GNU/Linux

The bug is also triggered by a ao.set(...) right after iocInit(), btw, and also by any other EPICS record type (I expressly tried stringOut, longOut and boolOut.

Everything works as expected when using In record types, e.g. boolIn or longIn, BTW.

codedump commented 5 months ago

[...] Unfortunately there doesn't seem to be an intptr_t option, so I'm not sure what the correct fix is here, and there isn't a size_t option either. It would be interesting to know the actual value (in hex, please!) of the pbuffer being passed into imports.db_put_field;

How would I provide that to you? Is there any way to access that from Python? Or do I need to build from source and printf() all over the place?

I'm assuming it's a 64-bit value with the top bit set being passed as an unsigned integer (am a bit surprised to see the top bit set in userspace code, so maybe something else is going on?)

The buffer being passed is constructed as

ctypes.addressof(ctypes.c_double(7.62))

This is a (large) integer that we somehow need to pass to extension.db_put_field as a pointer. Can you print the hex values you get on your system and compare?

This is on a x86 system:

>>> hex(ctypes.addressof(ctypes.c_double(7.62)))
'0x7f687cd8dc20'

And this on the Raspi:

hex(ctypes.addressof(ctypes.c_double(7.62)))
'0xf74eeaf8'

I would expect you to be able to reproduce the problem even more simply: does calling ao.set(7.62) straight after .iocInit() also fail in the same way?

Yes, it fails the same.

But keep in mind (see my other post) that this only fails for ...Out type of records, and it does fail for all of them as far as I can tell.

I don't know about pythonSoftIOC internals, but db_put_field doesn't sound like it's specific only to Out records, is it?

codedump commented 5 months ago

This might do the trick: [...]

I've patched pythonSoftIOC by hand and gave it a pip install -e .... Apparently your patch does indeed work. This is the Python script I'm using to test:

from softioc import softioc, builder
import cothread
builder.SetDeviceName("MY-DEVICE-PREFIX")
ao = builder.longOut('AO', initial_value=0)
builder.LoadDatabase()
softioc.iocInit()
ao.set(1)
softioc.interactive_ioc(globals()) 

Again, the same script failed previously.

Thanks!

Araneidae commented 5 months ago

That printf was probably enough information. Something very odd here ... I was expecting a 64-bit address! I definitely don't understand why that doesn't fit into a ssize_t. One wild idea ... is it possible that your Python is somehow using 32-bit addresses on a 64-bit system? I'm thinking of something like the rather weird x32 libraries ... but I'm pretty sure that's dead now and never made it onto aarch64.

Anyhow, it's excellent news that my fix works for you. We'll apply it our end and run through our regression tests, and I expect it can be a general fix.

Araneidae commented 5 months ago

Fixed in merge #154

Araneidae commented 5 months ago

I don't know about pythonSoftIOC internals, but db_put_field doesn't sound like it's specific only to Out records, is it?

Actually, yes it is! The difference is that .set() on In records is part of the normal driver -> EPICS processing flow and so is forwarded through normal record processing, whereas on Out records the updated value has to be effectively injected into the EPICS processing layer through a low level API: db_put_field. The difference is that In records carry content from the IOC to the outside, and so the IOC is in charge, whereas for Out records the flow of content is "pushed" from outside into the IOC. The In/Out naming convention is back to front.

EmilioPeJu commented 5 months ago

Hi @codedump, apparently you are using an Arch64 ABI called ILP32 in which pointers are 4 bytes, Could you please confirm that is the case? you can do that by building a small c program that prints the value of sizeof(void *), additionally, you could also check the size of ssize_t to know if the problem was about size mismatching or about unsigned/signed conversion

codedump commented 5 months ago

I am sorry for the late reply. I'm away from the system for now and will be for another while, so I can't test, but that being a 32 bit system makes sense.

It's an "older" Pi 4 (meaning: a few years back), and there was a time when Debian for ARM wasn't get stable enough for the Pi 4. The recommended practice back then eas to install the 32-bit version, and this is what might've happened here.

The system is running a "recent Debian" in the sense that it was a recent dist-upgrade.

The C pointer test seems to indicate, anyway, that it was a 32-bit system, and that's plausibility consistent with the.history and age of the system (it's a custom-made sample holder at BESSY-II, about 3 years old).

Im any case, the fix was working properly.

Thanks & Cheers, F.

Araneidae commented 5 months ago

Well, I'm very glad we caught this tricksy bug, and am happy with the fix .. so thanks for trying this on a weird system!