Closed David00 closed 4 years ago
Since this isn't a Python issue, I am unable to catch the problem with Python's error handling features. For anyone reading this in the future, a workaround is to configure your application as a service so it can be restarted automatically after a crash:
https://medium.com/@benmorel/creating-a-linux-service-with-systemd-611b5c8b91d6
I see a python3[...]: free(): invalid pointer message too when my service restarts every 60hrs or so. Running a service which restarts, so no big issue but nice if a fix is available.
It is a security feature in most c stdlibs to abort execution immediately if free is called on an invalid address. If you have the time, you can compile the python extension with ASAN, this should give a detailed crash report.
Why are you so certain that spidev is the root cause of all this? I can see that you are using some other libraries as well. This influx thing stood out particularly to me. As you are only using xfer2, I took a short look at its implementation and could not find any obvious bug leading to the described behaviour. So if you really want to trace this bug down, you should compile influx with ASAN as well. I think spidev is not guilty :) If this is too much work / you are inexperienced in this, it might be sufficient to run python with valgrind, e.g. PYTHONMALLOC=malloc valgrind python ./yourscript.py
.
Thanks KoWu. I have not heard of ASAN but I would be interested in recompiling the libraries to get some more info. I'll probably try with valgrind first (which I'm not familiar with either, but Google is quite awesome in that regard).
I suspected spidev first because I have had doubts of the performance of my hardware (Raspberry Pi 3b+), and out of the few libraries I'm using, spidev is doing the most work. My project reads from the SPI interface about 12 thousand times per second. Also, I did not suspect the influxdb library as the cause because it implements the requests
and socket
libraries which are much, much bigger projects and have been field tested on a much larger scale than spidev
. I think my suspicion is reasonably placed, but I am committed to trying to pinpoint the problem wherever it lies since I opened this issue.
I didn't know how to continue troubleshooting without a Python traceback, so thank you for providing me with some ideas on how to dig deeper. I'll report back with some findings after exploring the valgrind option.
Confirmed that this issue exists, either in pure spidev or something that it relies upon, using the following code running in Python3 against the shipped spidev 3.4.
I've also had multiple user reports of a project that relies on spidev periodically crashing (plus prior encountered with the same issue that I thought were fixed in 3.4):
import spidev
import time
bus = spidev.SpiDev(0, 0)
bus.mode = 0
bus.lsbfirst = False
bus.max_speed_hz = 80 * 1000000
bytes_total = 0
transfers_total = 0
last_update = time.time()
while True:
bus.xfer([0b10101010] * 4096)
bytes_total += 4096
transfers_total += 1
if time.time() - last_update >= 5.0:
print(time.time(), bytes_total, transfers_total)
last_update = time.time()
After 1.04 million writes, it fails with free(): invalid pointer
. This is approximately 21 minutes of continuous testing time.
I'm confused, because I rigorously tested spidev 3.4 due to prior encountered with this issue. In this case however, my isolated tests encountered a Segmentation Fault in spidev.
See: https://github.com/pimoroni/mopidy-pidi/issues/3 And: https://github.com/pimoroni/st7735-python/pull/7 (which mentions testing to 1.8 million display updates, which is considerably in excess of the conditions above)
Either my prior testing was insufficient (the scant mention of 1.8 million cycles seems to suggest otherwise) or something has broken/regressed in SPI elsewhere in the Raspberry Pi kernel-space that is causing this problem.
Edit: Probably coincidence but 1040000*4096 bytes is suspiciously close (in relative terms) to 2^32.
I am currently running a test using pure C ioctl access to the SPI bus to see if I see any related failure at >1 million writes. This is an especially frustrating problem to track down since it seems to consistently take >20 minutes of full-speed bashing the SPI bus to encounter the error. Right now I'm at 1.07 million and counting, so it would seem like the issue is potentially with py-spidev not being defensive enough.
I'm not too hot on this level of C/C++ debugging, but I did run valgrind against my minimal example script and ran into this burst of errors before the big finale:
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864A48: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B10: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B1C: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Use of uninitialised value of size 4
==20337== at 0x4864B24: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B28: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B44: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B58: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B70: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864B84: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864BB0: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864BDC: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864BE4: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864BFC: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864C14: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864C28: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864C44: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
==20337== Conditional jump or move depends on uninitialised value(s)
==20337== at 0x4864C54: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so)
==20337==
10561.25 - 4,260,139,008b 1,040,073
10621.26 - 4,284,350,464b 1,045,984
==20337== Invalid free() / delete / delete[] / realloc()
==20337== at 0x4848D14: free (vg_replace_malloc.c:538)
==20337== by 0x1B4F3F: ??? (in /home/pi/spitest/venv/bin/python3)
==20337== Address 0x435174 is in the BSS segment of /home/pi/spitest/venv/bin/python3
The notable excerpt being:
10621.26 - 4,284,350,464b 1,045,984
==20337== Invalid free() / delete / delete[] / realloc()
==20337== at 0x4848D14: free (vg_replace_malloc.c:538)
==20337== by 0x1B4F3F: ??? (in /home/pi/spitest/venv/bin/python3)
==20337== Address 0x435174 is in the BSS segment of /home/pi/spitest/venv/bin/python3
==20337==
Interestingly (and again this might be coincidence) this appears to be right at point at which the number of bytes transmitted advances from < 2^32 to >= 2^32.
This output is not, however, sufficient for me to reach any conclusion as to where it might be occuring. But since I have a Pi 4 this points to a potential memory leak, perhaps?
Now I'm reasonably sure we're leaking obscene amounts of RAM, I have whittled the code down and eyeballed it to see where that might be happening. malloc
and free
calls seem to match up, but I've noticed some issues around Python reference counting. Specifically here:
As near as I can tell, this code is missing a call to Py_DECREF(val)
after assigning the sequence item to the list.
I'm now running tests against this change and will prep a PR.
Thank you for the PR @Gadgetoid! I compiled this library from master using this commit and loaded it into my Python environment for testing. My application would typically crash after about 2 days, so I'll keep an eye on it and report back in several days.
Thanks @David00 for reporting this and thanks @Gadgetoid for diving in with the analysis and fix. I really appreciate it.
No worries! This manifested as one of our plugins crashing Mopidy for a number of users, so it shot up my priority list.
Doing some more sleuthing and raising a few more PRs that we can hopefully wrap up into a nice hardened release.
Hi, I just wanted to follow up and say that @Gadgetoid's commit seems to have resolved the crashing problem. My application has been running steady for over 4 days now without encountering the issue!
@doceme, are you able to release a new version with this fix on PyPi?
@David00, thanks for the feedback. Version 3.5 has been released on PyPI.
Version 3.5 seems to have solved my problem too! Running for 7 days now, before crashed every 3 days. Thanks!
Hi,
My project, which consists of an infinite loop that reads from the SPI interface and performs calculations on the data, crashes after several days of running. I've looked extensively for a way to troubleshoot the problem, but I've hit a dead end, hence why I'm opening this issue.
To be clear - the crash is not a Python traceback. It appears to be a C++ error, and it simply says:
Other instances of this issue that I've seen online usually include a memory address - the error I'm getting does not. The above output is the only text sent to my terminal before my program stops running.
My project is here if you want to inspect how I'm using the spidev library.