Closed HowardALandman closed 2 years ago
This is possibly another case of the Py_DECREF issue I found under similar circumstances- https://github.com/doceme/py-spidev/pull/96 - but make sure you're running the very latest release of spidev since some of these issues have been fixed.
I typically never use a tuple when interacting with this library, however, so my tests for stability would have overlooked any tuple-specific bugs.
Here's xfer with the unused SPIDEV_SINGLE
and Python version < 3 code branches pruned and some additional comments added. Since I'm useless with valgrind and other leak tracing techniques I usually use intuition and a thorough code review to track these things down:
static PyObject *
SpiDev_xfer(SpiDevObject *self, PyObject *args)
{
uint16_t ii, len;
int status;
uint16_t delay_usecs = 0;
uint32_t speed_hz = 0;
uint8_t bits_per_word = 0;
PyObject *obj;
PyObject *seq;
struct spi_ioc_transfer xfer;
memset(&xfer, 0, sizeof(xfer));
uint8_t *txbuf, *rxbuf;
char wrmsg_text[4096];
if (!PyArg_ParseTuple(args, "O|IHB:xfer", &obj, &speed_hz, &delay_usecs, &bits_per_word))
return NULL;
seq = PySequence_Fast(obj, "expected a sequence"); // seq = New reference
if (!seq) {
PyErr_SetString(PyExc_TypeError, wrmsg_list0);
return NULL;
}
len = PySequence_Fast_GET_SIZE(seq);
if (len <= 0) {
Py_DECREF(seq); // seq dec'd
PyErr_SetString(PyExc_TypeError, wrmsg_list0);
return NULL;
}
if (len > SPIDEV_MAXPATH) {
snprintf(wrmsg_text, sizeof(wrmsg_text) - 1, wrmsg_listmax, SPIDEV_MAXPATH);
PyErr_SetString(PyExc_OverflowError, wrmsg_text);
Py_DECREF(seq); // seq dec'd
return NULL;
}
txbuf = malloc(sizeof(__u8) * len);
rxbuf = malloc(sizeof(__u8) * len);
for (ii = 0; ii < len; ii++) {
PyObject *val = PySequence_Fast_GET_ITEM(seq, ii);
{
if (PyLong_Check(val)) {
txbuf[ii] = (__u8)PyLong_AS_LONG(val);
} else {
snprintf(wrmsg_text, sizeof(wrmsg_text) - 1, wrmsg_val, val);
PyErr_SetString(PyExc_TypeError, wrmsg_text);
free(txbuf);
free(rxbuf);
Py_DECREF(seq);
return NULL;
}
}
}
// If the input object is a tuple, then "seq" will also be a tuple
// Copy it to a list here for convenience since the code below expects it to be mutable
if (PyTuple_Check(obj)) {
Py_DECREF(seq);
seq = PySequence_List(obj);
}
xfer.tx_buf = (unsigned long)txbuf;
xfer.rx_buf = (unsigned long)rxbuf;
xfer.len = len;
xfer.delay_usecs = delay_usecs;
xfer.speed_hz = speed_hz ? speed_hz : self->max_speed_hz;
xfer.bits_per_word = bits_per_word ? bits_per_word : self->bits_per_word;
#ifdef SPI_IOC_WR_MODE32
xfer.tx_nbits = 0;
#endif
#ifdef SPI_IOC_RD_MODE32
xfer.rx_nbits = 0;
#endif
status = ioctl(self->fd, SPI_IOC_MESSAGE(1), &xfer);
if (status < 0) {
PyErr_SetFromErrno(PyExc_IOError);
free(txbuf);
free(rxbuf);
Py_DECREF(seq);
return NULL;
}
for (ii = 0; ii < len; ii++) {
PyObject *val = PyLong_FromLong((long)rxbuf[ii]); // val = New reference
PySequence_SetItem(seq, ii, val); // list item gets new reference, does not steal
Py_DECREF(val); // PySequence_SetItem does not steal reference, must Py_DECREF(val)
}
// WA:
// in CS_HIGH mode CS isn't pulled to low after transfer, but after read
// reading 0 bytes doesnt matter but brings cs down
// tomdean:
// Stop generating an extra CS except in mode CS_HOGH
if (self->mode & SPI_CS_HIGH) status = read(self->fd, &rxbuf[0], 0);
free(txbuf);
free(rxbuf);
// In cases where the input "obj" is a tuple, a tuple is returned
if (PyTuple_Check(obj)) {
// Gymnatics to decrement the reference count for seq, but still convert it to a tuple
// It's possible that this is failing and we could probably stand to rewrite this to something simpler
// Is this creating a new reference and then immediately shadowing it with seq? giving us a leak?
PyObject *old = seq;
seq = PySequence_Tuple(seq);
Py_DECREF(old);
}
return seq;
}
The PyTuple_Check
code branch looks deeply suspicious to me, and I'd be tempted to rewrite it into something like:
if (PyTuple_Check(obj)) {
PyObject *new;
new = PySequence_Tuple(seq);
Py_DECREF(seq);
return new;
}
Since I prefer to return early than try to mutate the return value.
I will run your code and see if I can hit the same error, then try my hunches and see what happens.
$ python3 -m pip show spidev
says I have version 3.4 on all machines.
I don't think tuple-vs-list matters. In my original it was a list and the bug still happened (inside of list_ass_item()
). I changed it to a tuple to see if that would fix it; it didn't.
I'm currently on batch = -57400001
which I assume is 57m calls. Running at 80MHz but wiggling the pins into thin air so it doesn't matter. Will see what I run into at the 107.4M mark.
Edit: now at batch = -118600001
- I deeply suspect this has already been fixed.
Nod, I upgraded to 3.5 and on the RPi4 it's passed 227M with no problem, and 118M on the RPi3B+. I'll let both machines run for a bit longer. If they both look good, I'll run my original full program on the RPi3B+ with the TDC7201 attached, but that'll take a day. And bump the tdc7201 module's setup.py to require spidev>=3.5.
Thanks and good luck! It's handy to have someone else helping bash the rough edges off this library.
Full program has run ~200M full measurement cycles without hitting this error, so I'm marking it fixed on my own software. :-)
Excellent! Thanks for testing so thoroughly. It's good to have some extra confidence on this, since I appreciate this library might have a.... few users 😆
Just to follow up ... my latest stress test has been running for over a week, and has executed over 1.6 billion measurement cycles with no sign of this bug. I am still seeing a few hardware errors, about 1 per 213 thousand measurements, but I don't think that's spidev's fault. And, considering that a few months ago it was 1 per thousand, I think I can live with those numbers for now. :-) Let me know if you ever want a beta version tested.
The attached program, which calls
xfer()
repeatedly in read mode, dies withfree(): invalid pointer
after about 107.4 million calls toxfer()
. It is a simplified version of code in theread_regs24()
method in my tdc7201 library. I'm typically running in Python 3.7.3.This can be hard to debug because
abort()
doesn't return from the C level, and so it's impossible to catch the SIGABRT in Python. Also, the spidev library doesn't work with python3-dbg (which might be a separate bug). I recommend running it under gdb with a break inmalloc_printerr()
. I suppose valgrind might also be helpful if you have the patience. Here's what I get from gdb:Breakpoint 1, malloc_printerr (str=0x76e028f8 "free(): invalid pointer")
at malloc.c:5341
5341 malloc.c: No such file or directory.
(gdb) bt
#0 malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341
#1 0x76d44d50 in _int_free (av=0x76e1f7d4 <main_arena>,
p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>) at malloc.c:4165
#2 0x001b4f40 in list_dealloc (op=0x766d0f58) at ../Objects/listobject.c:324
#3 0x7669b660 in ?? ()
from /usr/lib/python3/dist-packages/spidev.cpython-37m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
On a RPi4 this takes about 57 minutes to hit the bug (usually in batch -1074). But the bug also occurs on a RPi3B+ or a RPi0W. It occurs under both Stretch and Buster (32-bit).
It looks like the error is trying to free the integer 13 more times than it was allocated (
<small_ints.lto_priv+72>
).bug_spi2.py.gz