bluesky / ophyd

hardware abstraction in Python with an emphasis on EPICS
https://blueskyproject.io/ophyd
BSD 3-Clause "New" or "Revised" License
51 stars 79 forks source link

Consistent numpy byte order in Ophyd Signal #1186

Open cjtitus opened 7 months ago

cjtitus commented 7 months ago

This is a very niche issue, but important for some of the pod systems I'm running.

Description of the problem

Some time ago, I started running Ophyd and friends in a pod. However, I quickly ran into problems trying to move data over Kafka, which were related to the following behavior of tiled.utils.safe_json_dump when big-endian data is passed in:

In [82]: safe_json_dump({"test": np.array([0.0, 1.1, 2.2], dtype='<f8')})
Out[82]: b'{"test":[0.0,1.1,2.2]}'

In [83]: safe_json_dump({"test": np.array([0.0, 1.1, 2.2], dtype='>f8')})
Out[83]: b'{"test":[0.0,-1.5423487136706484e-180,-1.5423487136574978e-180]}' 

The json serializer in tiled relies very directly on the orjson package, which just doesn't care about byte order at all, meaning that data does not round trip.

But why was my data big-endian in the first-place? This turns out to come directly from the Ophyd signal

In [9]: tes.energies.cl.caget(tes.energies.pvname)
Out[9]: 
array([200.5, 201.5, 202.5, 203.5, 204.5, 205.5, 206.5, 207.5, 208.5,
       ...
       992.5, 993.5, 994.5, 995.5, 996.5, 997.5, 998.5, 999.5],
      dtype='>f8')

Which means that something in my container's network stack is prompting Caproto/EPICS to pass around big-endian data. This makes sense, as networking is pretty much the only place that big-endian data is used.

Where could I fix this?

This could probably be fixed in several places. In rough order of data-handling, they are

Why fix this in Ophyd?

I believe that Ophyd is the most plausible/helpful place to correct the endianness issue.

Why not fix Caproto/EPICS/my network configuration?

I personally haven't been able to figure out why my container is causing Caproto to send big-endian data, but even if the underlying issue is in Caproto, or even in EPICS, I believe that Ophyd should be able to handle this case more gracefully for maximum compatibility. Sometimes you'll get big-endian data, apparently, and Ophyd should be able to handle that.

Why not fix the serializer in Tiled?

This was my initial thought -- just add some checks to safe_json_dump in Tiled! Unfortunately, by the time you get to Tiled, you are dealing with highly nested data. It was really easy to serialize a solitary array. But to serialize Tiled data correctly, I would need to delve into the values of the dictionary, see if any of those were a numpy array, and change its byte order. But now the Tiled serializer has side effects unless you make a deep copy of all the content that gets passed in. And you are on the hook for going through all the nested entries of any dictionary, list, or other container. No thanks!

Why not fix orjson?

This is plausible, and I have submitted an issue to orjson. I don't know if they'll fix it, or even if they view it as a bug, or the serializer working as intended. However, even if it's fixed in orjson, it wouldn't help anybody else who's using a different serializer, or who has other issues related to inconsistent endianness in Ophyd.

Summary

I think that fixing this in Ophyd would be the most practical, consistent solution. We can check all the data that we ingest, and at least make sure that it's converted to a consistent byte order in Ophyd.

Where should we fix this in Ophyd?

The most immediate place to fix this, to me, was in the existing _fix_type method of EpicsSignalBase. It's a logical place for this kind of kludge. It's really the first place that Ophyd handles values that come in from the outside.

I don't think it's possible to go lower, as the control layer is really just a shim around pyepics, caproto, etc. But I'm open to suggestions.

tacaswell commented 7 months ago

I'm pretty curious how to accidentally make caproto give you big-endian values!

Is there a reasonable way to test this?

cjtitus commented 7 months ago

If you like, I can whip my containers into shape, try to push working versions, and you can see for yourself. For what it's worth, I don't think the issue is Caproto, in that I don't think big-endian values are being generated by caproto. In my mind, it is probably EPICS somehow dictating that they get serialized and sent over the network that way.

I was as surprised as anyone that this could be a problem, but here we are -- I can't run my pods unless this is fixed.

cjtitus commented 7 months ago

Just as a follow-up, I did open an issue in orjson, and the outcome is that putting big-endian values in will be an error in the future. https://github.com/ijl/orjson/issues/472

So that adds a reason to check for this in Ophyd to make sure we just never end up passing data that will cause problems later on.

tacaswell commented 7 months ago

Digging into this, big endian is the "native" format for CA: https://github.com/caproto/caproto/blob/46020e4816c7dbe8ccae9b7f0af362ce9a7f4f07/caproto/_numpy_backend.py#L11-L23

I think the bug here is actually in caproto in the pyepics_compat layer as it looks like pyepics is converting to native but caproto is not:

✔ 10:10:38 [belanna] @ ipython
imPython 3.12.3+ (heads/3.12:04d07964f2c, Apr 10 2024, 21:02:51) [GCC 13.2.1 20230801]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0.dev -- An enhanced Interactive Python. Type '?' for help.

In [1]: import epics

In [2]: pv = epics.get_pv('arr:array_int')
**** The executable "caRepeater" couldn't be located
**** because of errno = "No such file or directory".
**** You may need to modify your PATH environment variable.
**** Unable to start "CA Repeater" process.

In [3]: CA.Client.Exception...............................................
    Warning: "Identical process variable names on multiple servers"
    Context: "Channel: "arr:array_int", Connecting to: 192.168.86.33:5064, Ignored: belanna.tacaswell.github.beta.tailscale.net:5064"
    Source File: ../cac.cpp line 1320
    Current Time: Tue Apr 16 2024 10:10:59.964395438
..................................................................
In [3]: pv.read()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 pv.read()

AttributeError: 'PV' object has no attribute 'read'

In [4]: pv.get()
Out[4]: array([3])

In [5]: import caproto.threading.pyepics_compat

In [6]: pv2 = caproto.threading.pyepics_compat.get_pv('arr:array_int')

PV arr:array_int with cid 13138 found on multiple servers. Accepted address is 100.118.211.61:5064. Also found on 127.0.0.1:5064
PV arr:array_int with cid 13138 found on multiple servers. Accepted address is 100.118.211.61:5064. Also found on 192.168.86.33:5064
In [7]: pv2.get()
Out[7]: array([3], dtype='>i4')

In [8]: CA client library is unable to contact CA repeater after 50 tries.
Silence this message by starting a CA repeater daemon
or by calling ca_pend_event() and or ca_poll() more often.

I am inclined to say that this should be fixed farther down.

tacaswell commented 7 months ago

Additionally, the change upstream in orjson is not to reject bigendian, it is to reject non-native.

cjtitus commented 7 months ago

Good catch on the specifics of the orjson/numpy endian-specified vs native distinction. I will try to pull your caproto fix branch and make sure it also fixes the problem. Assuming it does, I don't have a major preference about where this gets fixed. Knowing that this is specifically due to a bug in Caproto's pyepics_compat does make me hopeful that once that gets merged, an additional check in Ophyd won't be necessary. And anyway, future problems of this sort will be raised as errors by orjson.

Your caproto fix does make me think that I should actually do the ophyd check in the same way, by checking for non-native byte order, and swapping to native, rather than setting the endianness, if we are going to do an additional check in Ophyd.