Datatype of uncalibrated waveform in uproot backend causes unexpected behaviour

RNO-G / mattak

RNO-G dataformats

1 stars 1 forks source link

Datatype of uncalibrated waveform in uproot backend causes unexpected behaviour #29

Closed fschlueter closed 4 months ago

fschlueter commented 4 months ago

When getting accessing a uncalibrated waveform via the uproot back and and multiplies it with a float (for example to perform a pseudo calibration), the result is wrong.

E.g. try something like this:

d = mattak.Dataset.Dataset(0, 0, "run8955/", backend="uproot")
d.setEntries(200)
wf_adc = d.wfs()[0]

fig, ax = plt.subplots()
ax.plot(wf_adc)
wf_voltage = wf_adc * 2500 / 4095
ax2 = ax.twinx()
ax2.plot(wf_voltage)
plt.show()

A possible solution would be to convert the wf array to an float, i.e., add to the wfs() function a w = numpy.asarray(w, dtype=float).

fschlueter commented 4 months ago

@cozzyd I can make the fix but please let me know what you think.

cozzyd commented 4 months ago

I'm a little confused, I thought numpy would already change the output data type?

$ ipython3
Python 3.12.1 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: v = np.array([1,2,3,4],dtype=np.int16)

In [3]: v
Out[3]: array([1, 2, 3, 4], dtype=int16)

In [4]: scaled_v = v * 2500/4095

In [5]: scaled_v
Out[5]: array([0.61050061, 1.22100122, 1.83150183, 2.44200244])

cozzyd commented 4 months ago

maybe the problem is that they're not numpy arrays but awkward arrays or something?

cozzyd commented 4 months ago

It also won't let you do it in-place...

 v*= 2500/4095

UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int16') with casting rule 'same_kind'

cozzyd commented 4 months ago

So actually the problem is more subtle.

The way you wrote it, first you multiply by 2500, which is an integer and therefore does not result in converting the output to float, but unfortunately overflows and wraps around (and while numpy will convert output to float when you do a float operation, it does not detect overflow and convert to float or a bigger int in that case). Then that result is divided by 4095 which in Python is a floating point operation and results in float output.

If you did instead * (2500/4095) you would not have this problem (and you would also as a side effect avoid looping over the array twice). Whether or not this is something we should fix, I'm not sure. Almost any useful operation will end up converting to float ayway, but there may be cases where you want to save the memory / cycles.

Probably the best thing to do is automatically convert to float by default but add a flag to avoid the cast so that if you "know what you're doing" you can avoid it.

fschlueter commented 4 months ago

Or can we use int32 instead of int16? Not sure if that screws up the "casting" out of the uproot file? And it also only pushes the issue to higher values ..

cozzyd commented 4 months ago

Converting to floats isn't much more expensive than converting to ints, so I'm not sure it would make sense to go to an int32. I guess an int32 takes up less space than a float64, but using a float32 is probably very sensible anyway in our case?

See this pull request: https://github.com/RNO-G/mattak/pull/33

fschlueter commented 4 months ago

In the end we casted to float always, merged with #24