Dies with free(): invalid pointer after ~107.4M calls to xfer()

HowardALandman commented 4 years ago

The attached program, which calls xfer() repeatedly in read mode, dies with free(): invalid pointer after about 107.4 million calls to xfer(). It is a simplified version of code in the read_regs24() method in my tdc7201 library. I'm typically running in Python 3.7.3.

This can be hard to debug because abort() doesn't return from the C level, and so it's impossible to catch the SIGABRT in Python. Also, the spidev library doesn't work with python3-dbg (which might be a separate bug). I recommend running it under gdb with a break in malloc_printerr(). I suppose valgrind might also be helpful if you have the patience. Here's what I get from gdb:

Breakpoint 1, malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341 5341 malloc.c: No such file or directory. (gdb) bt #0 malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341 #1 0x76d44d50 in _int_free (av=0x76e1f7d4 <main_arena>, p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>) at malloc.c:4165 #2 0x001b4f40 in list_dealloc (op=0x766d0f58) at ../Objects/listobject.c:324 #3 0x7669b660 in ?? () from /usr/lib/python3/dist-packages/spidev.cpython-37m-arm-linux-gnueabihf.so Backtrace stopped: previous frame identical to this frame (corrupt stack?)

On a RPi4 this takes about 57 minutes to hit the bug (usually in batch -1074). But the bug also occurs on a RPi3B+ or a RPi0W. It occurs under both Stretch and Buster (32-bit).

It looks like the error is trying to free the integer 13 more times than it was allocated (<small_ints.lto_priv+72>).

bug_spi2.py.gz

Gadgetoid commented 4 years ago

This is possibly another case of the Py_DECREF issue I found under similar circumstances- https://github.com/doceme/py-spidev/pull/96 - but make sure you're running the very latest release of spidev since some of these issues have been fixed.

I typically never use a tuple when interacting with this library, however, so my tests for stability would have overlooked any tuple-specific bugs.

Here's xfer with the unused SPIDEV_SINGLE and Python version < 3 code branches pruned and some additional comments added. Since I'm useless with valgrind and other leak tracing techniques I usually use intuition and a thorough code review to track these things down:

static PyObject *
SpiDev_xfer(SpiDevObject *self, PyObject *args)
{
    uint16_t ii, len;
    int status;
    uint16_t delay_usecs = 0;
    uint32_t speed_hz = 0;
    uint8_t bits_per_word = 0;
    PyObject *obj;
    PyObject *seq;
    struct spi_ioc_transfer xfer;
    memset(&xfer, 0, sizeof(xfer));
    uint8_t *txbuf, *rxbuf;
    char    wrmsg_text[4096];

    if (!PyArg_ParseTuple(args, "O|IHB:xfer", &obj, &speed_hz, &delay_usecs, &bits_per_word))
        return NULL;

    seq = PySequence_Fast(obj, "expected a sequence");       // seq = New reference
    if (!seq) {
        PyErr_SetString(PyExc_TypeError, wrmsg_list0);
        return NULL;
    }

    len = PySequence_Fast_GET_SIZE(seq);
    if (len <= 0) {
        Py_DECREF(seq);                                                       // seq dec'd
        PyErr_SetString(PyExc_TypeError, wrmsg_list0);
        return NULL;
    }

    if (len > SPIDEV_MAXPATH) {
        snprintf(wrmsg_text, sizeof(wrmsg_text) - 1, wrmsg_listmax, SPIDEV_MAXPATH);
        PyErr_SetString(PyExc_OverflowError, wrmsg_text);
        Py_DECREF(seq);                                                       // seq dec'd
        return NULL;
    }

    txbuf = malloc(sizeof(__u8) * len);
    rxbuf = malloc(sizeof(__u8) * len);

    for (ii = 0; ii < len; ii++) {
        PyObject *val = PySequence_Fast_GET_ITEM(seq, ii);
        {
            if (PyLong_Check(val)) {
                txbuf[ii] = (__u8)PyLong_AS_LONG(val);
            } else {
                snprintf(wrmsg_text, sizeof(wrmsg_text) - 1, wrmsg_val, val);
                PyErr_SetString(PyExc_TypeError, wrmsg_text);
                free(txbuf);
                free(rxbuf);
                Py_DECREF(seq);
                return NULL;
            }
        }
    }

        // If the input object is a tuple, then "seq" will also be a tuple
        // Copy it to a list here for convenience since the code below expects it to be mutable
    if (PyTuple_Check(obj)) {
        Py_DECREF(seq);
        seq = PySequence_List(obj);
    }

    xfer.tx_buf = (unsigned long)txbuf;
    xfer.rx_buf = (unsigned long)rxbuf;
    xfer.len = len;
    xfer.delay_usecs = delay_usecs;
    xfer.speed_hz = speed_hz ? speed_hz : self->max_speed_hz;
    xfer.bits_per_word = bits_per_word ? bits_per_word : self->bits_per_word;
#ifdef SPI_IOC_WR_MODE32
    xfer.tx_nbits = 0;
#endif
#ifdef SPI_IOC_RD_MODE32
    xfer.rx_nbits = 0;
#endif

    status = ioctl(self->fd, SPI_IOC_MESSAGE(1), &xfer);
    if (status < 0) {
        PyErr_SetFromErrno(PyExc_IOError);
        free(txbuf);
        free(rxbuf);
        Py_DECREF(seq);
        return NULL;
    }

    for (ii = 0; ii < len; ii++) {
        PyObject *val = PyLong_FromLong((long)rxbuf[ii]);  // val = New reference
        PySequence_SetItem(seq, ii, val);                              // list item gets new reference, does not steal
        Py_DECREF(val);                                                        // PySequence_SetItem does not steal reference, must Py_DECREF(val)
    }

    // WA:
    // in CS_HIGH mode CS isn't pulled to low after transfer, but after read
    // reading 0 bytes doesnt matter but brings cs down
    // tomdean:
    // Stop generating an extra CS except in mode CS_HOGH
    if (self->mode & SPI_CS_HIGH) status = read(self->fd, &rxbuf[0], 0);

    free(txbuf);
    free(rxbuf);

        // In cases where the input "obj" is a tuple, a tuple is returned
    if (PyTuple_Check(obj)) {
                // Gymnatics to decrement the reference count for seq, but still convert it to a tuple
                // It's possible that this is failing and we could probably stand to rewrite this to something simpler

                // Is this creating a new reference and then immediately shadowing it with seq? giving us a leak?
        PyObject *old = seq;
        seq = PySequence_Tuple(seq);
        Py_DECREF(old);
    }

    return seq;
}

The PyTuple_Check code branch looks deeply suspicious to me, and I'd be tempted to rewrite it into something like:

    if (PyTuple_Check(obj)) {
        PyObject *new;
        new = PySequence_Tuple(seq);
        Py_DECREF(seq);
                return new;
    }

Since I prefer to return early than try to mutate the return value.

I will run your code and see if I can hit the same error, then try my hunches and see what happens.

HowardALandman commented 4 years ago

$ python3 -m pip show spidev says I have version 3.4 on all machines.

I don't think tuple-vs-list matters. In my original it was a list and the bug still happened (inside of list_ass_item()). I changed it to a tuple to see if that would fix it; it didn't.

Gadgetoid commented 4 years ago

I'm currently on batch = -57400001 which I assume is 57m calls. Running at 80MHz but wiggling the pins into thin air so it doesn't matter. Will see what I run into at the 107.4M mark.

Edit: now at batch = -118600001 - I deeply suspect this has already been fixed.

HowardALandman commented 4 years ago

Nod, I upgraded to 3.5 and on the RPi4 it's passed 227M with no problem, and 118M on the RPi3B+. I'll let both machines run for a bit longer. If they both look good, I'll run my original full program on the RPi3B+ with the TDC7201 attached, but that'll take a day. And bump the tdc7201 module's setup.py to require spidev>=3.5.

Gadgetoid commented 4 years ago

Thanks and good luck! It's handy to have someone else helping bash the rough edges off this library.

HowardALandman commented 4 years ago

Full program has run ~200M full measurement cycles without hitting this error, so I'm marking it fixed on my own software. :-)

Gadgetoid commented 4 years ago

Excellent! Thanks for testing so thoroughly. It's good to have some extra confidence on this, since I appreciate this library might have a.... few users 😆

HowardALandman commented 4 years ago

Just to follow up ... my latest stress test has been running for over a week, and has executed over 1.6 billion measurement cycles with no sign of this bug. I am still seeing a few hardware errors, about 1 per 213 thousand measurements, but I don't think that's spidev's fault. And, considering that a few months ago it was 1 per thousand, I think I can live with those numbers for now. :-) Let me know if you ever want a beta version tested.

doceme / py-spidev

Dies with free(): invalid pointer after ~107.4M calls to xfer() #107