adafruit / Adafruit_CircuitPython_DHT

CircuitPython support for DHT11 and DHT22 type temperature/humidity devices
MIT License
179 stars 62 forks source link

After some time of regular querying (~hours), libgpiod_pulsein starts using 100% CPU and subsequent readings fail #50

Closed michaellass closed 2 years ago

michaellass commented 3 years ago

Reported by @darton in https://github.com/adafruit/Adafruit_CircuitPython_DHT/pull/47#issuecomment-689438687 and https://github.com/adafruit/Adafruit_Blinka/issues/210#issuecomment-689427753:

The problem with 100% usage of libgpiod_pulsein CPU on Raspberry Pi with DHT22 still exists.

However, sometimes you have to wait several hours for it to appear and permanently prevent reading data from the DHT22 sensor.

After today's update to the version below, the script runs 8 hours without any problems.

Raspbian GNU/Linux 10 (buster) Raspberry Pi 3 Model B Plus Rev 1.3 Python 3.7.3 Adafruit-Blinka-5.3.4 Adafruit-PlatformDetect-2.16 Adafruit-PureIO-1.1.5 RPi.GPIO-0.7.0 adafruit-circuitpython-dht-3.5.1 pyftdi-0.51.2 pyserial-3.4 pyusb-1.0.2 rpi-ws281x-4.2.4 sysv-ipc-1.0.1

But after 8 hours, the problem came back.

michaellass commented 3 years ago

@darton Did I describe your issue correctly with the title of this bug report? I have never tested such a use-case. If I understand correctly, something like the following should reproduce your issue?

#!/usr/bin/env python3

import adafruit_dht
import board
import sys
import time

pin = board.D2 # adjust to correct PIN
dht_device = adafruit_dht.DHT22(pin)

# With Adafruit_CircuitPython_DHT 3.5.1 the CPU will now just idle until it's actually used

# During this sleep libgpiod_pulsein will start taking 100% CPU
time.sleep(12 * 3600) # Sleep for 12 hours

# ... and subsequent readings will fail:
try:
    temperature = dht_device.temperature
    humidity = dht_device.humidity
except RuntimeError:
    print("Error while getting sensor data")
    sys.exit(1)

print("Temp: {}, Humid: {}", temperature, humidity)
darton commented 3 years ago

I read the sensor every 5 seconds. Problems begin after a few hours, but sometimes much earlier, e.g. 30 minutes, 1 hour.

darton commented 3 years ago

The DHT22 sensor is connected to the Raspberry Pi using a 50 cm shielded cable

michaellass commented 3 years ago

Thanks for the clarification. Unfortunately that means that I probably won't be able to reproduce your issue here. I am querying my sensor every minute for months now without any similar issue. Possible differences between our setups:

Are you willing to invest some effort into debugging this? If so, please follow the following steps and post the output created by the final gdb command:

apt install gdb git build-essential libgpiod-dev

git clone https://github.com/adafruit/libgpiod_pulsein.git
cd libgpiod_pulsein/src

# modify Makefile: Add " -g" at the end of line 2 (defining CFLAGS). Then run:
make

# While your sensor polling script is running, figure out where libgpiod_pulsein is installed:
ls -lad /proc/$(pidof libgpiod_pulsein)/exe
# result may be sth like /home/youruser/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/libgpiod_pulsein
# or something in /usr/local/...

# Stop your sensor polling script. Then:
cp libgpiod_pulsein WHEREVER_YOUR_libgpiod_pulsein_IS_LOCATED

# Start your polling script again. Wait until it starts to consume 100% CPU and stops returning any sensible information. Then run:
gdb attach $(pidof libgpiod_pulsein) -ex "thread apply all bt" -ex "detach" -ex "quit"

The output of the final gdb command should be very helpful in debugging this. Maybe run the last command multiple times to see if it gives you different outputs. Then all of those might be of interest.

darton commented 3 years ago

sudo gdb attach $(pidof libgpiod_pulsein) -ex "thread apply all bt" -ex "detach" -ex "quit" GNU gdb (Raspbian 8.2.1-2) 8.2.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "arm-linux-gnueabihf". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... attach: No such file or directory. Attaching to process 19266 [New LWP 19267] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". __libc_msgrcv (msgflg=1, msgtyp=1, msgsz=4095, msgp=0x7ed47bbc, msqid=163840) at ../sysdeps/unix/sysv/linux/msgrcv.c:28 28 ../sysdeps/unix/sysv/linux/msgrcv.c: No such file or directory.

Thread 2 (Thread 0x76d93460 (LWP 19267)):

0 __lll_lock_wait (futex=futex@entry=0x24084 , private=0) at lowlevellock.c:46

1 0x76eeaf44 in __GI___pthread_mutex_lock (mutex=0x24084 ) at pthread_mutex_lock.c:80

2 0x000116dc in polling_thread_runner ()

3 0x76f77968 in __pointer_chk_guard () from /lib/ld-linux-armhf.so.3

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x76f76450 (LWP 19266)):

0 __libc_msgrcv (msgflg=1, msgtyp=1, msgsz=4095, msgp=0x7ed47bbc, msqid=163840) at ../sysdeps/unix/sysv/linux/msgrcv.c:28

1 __libc_msgrcv (msqid=163840, msgp=0x7ed47bbc, msgsz=4095, msgtyp=1, msgflg=0) at ../sysdeps/unix/sysv/linux/msgrcv.c:24

2 0x00010f28 in main ()

Detaching from program: /home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/libgpiod_pulsein, process 19266 [Inferior 1 (process 19266) detached]

darton commented 3 years ago

19266 pi 20 0 10324 428 316 R 100.0 0.0 81:55.23 libgpiod_pulsei

michaellass commented 3 years ago

Thread 2 (Thread 0x76d93460 (LWP 19267)):

0 __lll_lock_wait (futex=futex@entry=0x24084 , private=0) at lowlevellock.c:46

1 0x76eeaf44 in __GI___pthread_mutex_lock (mutex=0x24084 ) at pthread_mutex_lock.c:80

2 0x000116dc in polling_thread_runner ()

Looks like some of the steps above did not really work out as we are not getting the exact code position in libgpiod_pulsein.c here. However, It's very likely this one: https://github.com/adafruit/libgpiod_pulsein/blob/master/src/libgpiod_pulsein.c#L477. If not, it's https://github.com/adafruit/libgpiod_pulsein/blob/master/src/libgpiod_pulsein.c#L525.

3 0x76f77968 in __pointer_chk_guard () from /lib/ld-linux-armhf.so.3

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x76f76450 (LWP 19266)):

0 __libc_msgrcv (msgflg=1, msgtyp=1, msgsz=4095, msgp=0x7ed47bbc, msqid=163840) at ../sysdeps/unix/sysv/linux/msgrcv.c:28

1 __libc_msgrcv (msqid=163840, msgp=0x7ed47bbc, msgsz=4095, msgtyp=1, msgflg=0) at ../sysdeps/unix/sysv/linux/msgrcv.c:24

2 0x00010f28 in main ()

Apart from the missing line numbers this looks exactly how it should look like:

To be honest I'm lacking a good explanation for the CPU consumption you are seeing. It could be that the message queue becomes invalid, such that msgrcv becomes non-blocking. I have to think about this and will try to come up with some debugging strategy...

michaellass commented 3 years ago

To check whether the msgrcv is the culprit here, could you please run the following commands whenever your script is misbehaving (i.e., showing 100% CPU)?

sudo apt install strace
strace -f -p $(pidof libgpiod_pulsein)

You can simply abort the strace after a short amount of time with Ctrl+C. Normally, the output should look something like this:

strace: Process 731 attached with 2 threads
[pid   732] futex(0x24084, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid   731] msgrcv(0,

I would assume that in your case you get lots of lines printed in a short amount of time. Can you please post your output here?

Krakatomi commented 3 years ago

Hi. I got this: [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 4402] msgrcv(0, 0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) ^C0xbeef8bac, 4095, 1, 0) = -1 EINVAL (Invalid argument) strace: Process 4402 detached strace: Process 4403 detached

michaellass commented 3 years ago

@Kraka23 Thanks, that helps a lot!

Krakatomi commented 3 years ago

Your welcome!. You are helping us a lot too!

michaellass commented 3 years ago

I still cannot provoke this behavior, so I need some help again. Could you please run the following commands while libgpiod_pulsein is stuck at 100% CPU?

cat /proc/$(pidof libgpiod_pulsein)/cmdline
strace -f -p $(pidof libgpiod_pulsein) # exactly like in your last post but it is important to re-run it here
ipcs -q
lsipc

A couple of technical details on where we are right now:

Krakatomi commented 3 years ago

No problem. Here we go:

_cat /proc/$(pidof libgpiod_pulsein)/cmdline_ ==> /home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/libgpiod_pulsein--pulses81--queue323-igpioc

_strace -f -p $(pidof libgpiodpulsein) ==> [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 7505] msgrcv(0, 0xbeb44bbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) ^Cstrace: Process 7505 detached strace: Process 7506 detached

ipcs -q ==> ------ Message Queues -------- key msqid owner perms used-bytes messages

lsipc ==> RESOURCE DESCRIPTION LIMIT USED USE% MSGMNI Number of message queues 32000 0 0.00% MSGMAX Max size of message (bytes) 8192 - - MSGMNB Default max size of queue (bytes) 16384 - - SHMMNI Shared memory segments 4096 0 0.00% SHMALL Shared memory pages 4278190079 0 0.00% SHMMAX Max size of shared memory segment (bytes) 4278190079 - - SHMMIN Min size of shared memory segment (bytes) 1 - - SEMMNI Number of semaphore identifiers 32000 0 0.00% SEMMNS Total number of semaphores 1024000000 0 0.00% SEMMSL Max semaphores per semaphore set. 32000 - - SEMOPM Max number of operations per semop(2) 500 - - SEMVMX Semaphore max value 32767 - -

Hope that help.

michaellass commented 3 years ago

ipcs -q ==> ------ Message Queues -------- key msqid owner perms used-bytes messages

So the message queue is definitely gone. The best thing we can do in this case in libgpiod_pulsein is to terminate because there is no way to re-establish the communication to the python module. I will send a pull request for this later.

What's still a mystery to me is why the message queue disappears without libgpiod_pulsein being terminated by the python module. Could you describe how exactly you query the sensor? I.e., is it one long-running python script where you initiliaze the sensor once and then query regularly or is it some different setup?

darton commented 3 years ago

From observing the work of the DHT22 sensor over the last few years, I would allow myself to conclude that the DHT22 is a poor sensor and does not like too long work, then it becomes very unstable.

michaellass commented 3 years ago

From observing the work of the DHT22 sensor over the last few years, I would allow myself to conclude that the DHT22 is a poor sensor and does not like too long work, then it becomes very unstable.

The problem with the DHT11 and DHT22 is that they use a very timing sensitive protocol to communicate. That works pretty well with microcontrollers but it requires quite some effort to get it working reliably on a CPU with varying clock frequencies and an operating systems like Linux. However, as it currently looks like this bug is a pure software issue and actually not related to the sensor itself.

Have you tried the straceand ipcs commands as well? For now I just assumed that you and @Kraka23 are facing the exact same bug.

darton commented 3 years ago

[pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) [pid 14193] msgrcv(8, 0x7ef9bbbc, 4095, 1, 0) = -1 EINVAL (Invalid argument) ^C[pid 14193] msgrcv(8, strace: Process 14193 detached <detached ...> strace: Process 14194 detached

cat /proc/$(pidof libgpiod_pulsein)/cmdline /home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/libgpiod_pulsein--pulses81--queue11753-igpiochip017

ipcs -q

------ Message Queues -------- key msqid owner perms used-bytes messages

pi@rpims-z2:~ $ lsipc RESOURCE DESCRIPTION LIMIT USED USE% MSGMNI Number of message queues 32000 0 0.00% MSGMAX Max size of message (bytes) 8192 - - MSGMNB Default max size of queue (bytes) 16384 - - SHMMNI Shared memory segments 4096 2 0.05% SHMALL Shared memory pages 4278190079 89 0.00% SHMMAX Max size of shared memory segment (bytes) 4278190079 - - SHMMIN Min size of shared memory segment (bytes) 1 - - SEMMNI Number of semaphore identifiers 32000 0 0.00% SEMMNS Total number of semaphores 1024000000 0 0.00% SEMMSL Max semaphores per semaphore set. 32000 - - SEMOPM Max number of operations per semop(2) 500 - - SEMVMX Semaphore max value 32767 - -

Krakatomi commented 3 years ago

ipcs -q ==> ------ Message Queues -------- key msqid owner perms used-bytes messages

So the message queue is definitely gone. The best thing we can do in this case in libgpiod_pulsein is to terminate because there is no way to re-establish the communication to the python module. I will send a pull request for this later.

What's still a mystery to me is why the message queue disappears without libgpiod_pulsein being terminated by the python module. Could you describe how exactly you query the sensor? I.e., is it one long-running python script where you initiliaze the sensor once and then query regularly or is it some different setup?

Hello again. The script run at start with systemctl. Here it is:

`

!/usr/bin/env python3

import time import board import adafruit_dht import time import sys import datetime from influxdb import InfluxDBClient

host = "0.0.0.0" # Configure InfluxDB connection variables port = 8086 # default port user = "" # the user/password created for the pi, with write access password = "" dbname = "metricas" # the database we created earlier interval = 120 # Sample period in seconds

client = InfluxDBClient(host, port, user, password, dbname) # Create the InfluxDB client object

sensor = adafruit_dht.DHT22(board.D24) # Enter the sensor details

measurement = "dht22" location = "living"

while True: try:

    # humidity, temperature = Adafruit_DHT.read_retry(sensor, sensor_gpio)   # Read the sensor using the configured driver and gpio
    temperature = sensor.temperature
    humidity = sensor.humidity
    #iso = time.ctime()
    iso = datetime.datetime.utcnow()
    # Print for debugging, uncomment the below line
    #print("[%s] Temp: %s, Humidity: %s" % (iso, temperature, humidity))
    # Create the JSON data structure
    data = [
    {
      "measurement": measurement,
          "tags": {
              "location": location,
          },
          "time": iso,
          "fields": {
              "temperatura" : temperature,
              "humedad": humidity
          }
      }
    ]
    # Send the JSON data to InfluxDB
    client.write_points(data)
    # Wait until it's time to query again...
    time.sleep(interval)

#except KeyboardInterrupt:
except:
    pass

`

And this is my first script that was working ok until stuck without notice and then update/change the modules and start with this problem (100% CPU).

` import time import sys import datetime import Adafruit_DHT from influxdb import InfluxDBClient

host = "0.0.0.0" # Configure InfluxDB connection variables port = 8086 # default port user = "" # the user/password created for the pi, with write access password = "" dbname = "metricas" # the database we created earlier interval = 120 # Sample period in seconds

client = InfluxDBClient(host, port, user, password, dbname) # Create the InfluxDB client object

sensor = Adafruit_DHT.DHT22 # Enter the sensor details sensor_gpio = 24

measurement = "dht22" location = "living"

while True: try:

Read the sensor using the configured driver and gpio

    humidity, temperature = Adafruit_DHT.read_retry(sensor, sensor_gpio)
    #iso = time.ctime()
    iso = datetime.datetime.utcnow()
    # Print for debugging, uncomment the below line
    #print("[%s] Temp: %s, Humidity: %s" % (iso, temperature, humidity))
    # Create the JSON data structure
    data = [
    {
      "measurement": measurement,
          "tags": {
              "location": location,
          },
          "time": iso,
          "fields": {
              "temperatura" : temperature,
              "humedad": humidity
          }
      }
    ]
    # Send the JSON data to InfluxDB
    client.write_points(data)
    # Wait until it's time to query again...
    time.sleep(interval)

#except KeyboardInterrupt:
except:
    pass

`

Hope that help. I can´t fix the format. Sorry.

michaellass commented 3 years ago

What's still a mystery to me is why the message queue disappears without libgpiod_pulsein being terminated by the python module. Could you describe how exactly you query the sensor? I.e., is it one long-running python script where you initiliaze the sensor once and then query regularly or is it some different setup?

Hello again. The script run at start with systemctl. Here it is: [...]

This explains why we do not see any errors from the Python side:

  1. The script explicitly hides them using except: pass,
  2. output from systemd services running python often do not immediately show up in the journal due to Python's output buffering.

Could you please do the following?

  1. Add import traceback on the top of your script and replace the except: pass by:
    except KeyboardInterrupt:
        break
    except:
        traceback.print_exc()
  2. Either run your script manually or, if you rely on systemd do the following:
    1. Modify your systemd service file, such that it runs /usr/bin/python3 -u /path/to/your/script.py. The -u will make sure that any output from Python is immediately written to the journal.
    2. Afterwards please do a sudo systemctl daemon-reload and restart your service.
    3. As soon as the problem reappears, have a look into the journal by running sudo journalctl --unit=name_of_your_service.

You will very likely be confronted with exceptions from time to time ("Checksum did not validate" or "A full buffer was not returned."). That is normal. Of interest is what the output looks like when the script starts to misbehave again, i.e., the very first errors shown before only errors occur.

Hope that help. I can´t fix the format. Sorry.

No worries, it's easy enough to reconstruct. For the future: You can surround multi-line code with three instead of a single backtick, then the format is kept.

Krakatomi commented 3 years ago

Hello!. I have your info.

-- Logs begin at Mon 2020-10-26 01:56:38 -03, end at Mon 2020-10-26 21:41:51 -03. --
Oct 26 21:03:05 rpi4 systemd[1]: Started Recolecta datos de sensor dht22.
Oct 26 21:07:07 rpi4 python3[6407]: Traceback (most recent call last):
Oct 26 21:07:07 rpi4 python3[6407]:   File "/mnt/squid-usb/scripts/dht22.py", line 37, in <module>
Oct 26 21:07:07 rpi4 python3[6407]:     temperature = sensor.temperature
Oct 26 21:07:07 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 253, in temperature
Oct 26 21:07:07 rpi4 python3[6407]:     self.measure()
Oct 26 21:07:07 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 198, in measure
Oct 26 21:07:07 rpi4 python3[6407]:     pulses = self._get_pulses_pulseio()
Oct 26 21:07:07 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 129, in _get_pulses_pulseio
Oct 26 21:07:07 rpi4 python3[6407]:     self.pulse_in.clear()
Oct 26 21:07:07 rpi4 python3[6407]:   File "/home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/PulseIn.py", line 140, in clear
Oct 26 21:07:07 rpi4 python3[6407]:     self._mq.send("c", True, type=1)
Oct 26 21:07:07 rpi4 python3[6407]: OSError: [Errno 22] Invalid argument
Oct 26 21:09:07 rpi4 python3[6407]: Traceback (most recent call last):
Oct 26 21:09:07 rpi4 python3[6407]:   File "/mnt/squid-usb/scripts/dht22.py", line 37, in <module>
Oct 26 21:09:07 rpi4 python3[6407]:     temperature = sensor.temperature
Oct 26 21:09:07 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 253, in temperature
Oct 26 21:09:07 rpi4 python3[6407]:     self.measure()
Oct 26 21:09:07 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 198, in measure
Oct 26 21:09:07 rpi4 python3[6407]:     pulses = self._get_pulses_pulseio()
Oct 26 21:09:07 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 129, in _get_pulses_pulseio
Oct 26 21:09:07 rpi4 python3[6407]:     self.pulse_in.clear()
Oct 26 21:09:07 rpi4 python3[6407]:   File "/home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/PulseIn.py", line 140, in clear
Oct 26 21:09:07 rpi4 python3[6407]:     self._mq.send("c", True, type=1)
Oct 26 21:09:07 rpi4 python3[6407]: OSError: [Errno 22] Invalid argument
Oct 26 21:11:10 rpi4 python3[6407]: Traceback (most recent call last):
Oct 26 21:11:10 rpi4 python3[6407]:   File "/mnt/squid-usb/scripts/dht22.py", line 37, in <module>
Oct 26 21:11:10 rpi4 python3[6407]:     temperature = sensor.temperature
Oct 26 21:11:10 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 253, in temperature
Oct 26 21:11:10 rpi4 python3[6407]:     self.measure()
Oct 26 21:11:10 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 198, in measure
Oct 26 21:11:10 rpi4 python3[6407]:     pulses = self._get_pulses_pulseio()
Oct 26 21:11:10 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 129, in _get_pulses_pulseio
Oct 26 21:11:10 rpi4 python3[6407]:     self.pulse_in.clear()
Oct 26 21:11:10 rpi4 python3[6407]:   File "/home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/PulseIn.py", line 140, in clear
Oct 26 21:11:10 rpi4 python3[6407]:     self._mq.send("c", True, type=1)
Oct 26 21:11:10 rpi4 python3[6407]: OSError: [Errno 22] Invalid argument
Oct 26 21:13:10 rpi4 python3[6407]: Traceback (most recent call last):
Oct 26 21:13:10 rpi4 python3[6407]:   File "/mnt/squid-usb/scripts/dht22.py", line 37, in <module>
Oct 26 21:13:10 rpi4 python3[6407]:     temperature = sensor.temperature
Oct 26 21:13:10 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 253, in temperature
Oct 26 21:13:10 rpi4 python3[6407]:     self.measure()
Oct 26 21:13:10 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 198, in measure
Oct 26 21:13:10 rpi4 python3[6407]:     pulses = self._get_pulses_pulseio()
Oct 26 21:13:10 rpi4 python3[6407]:   File "/usr/local/lib/python3.7/dist-packages/adafruit_dht.py", line 129, in _get_pulses_pulseio
Oct 26 21:13:10 rpi4 python3[6407]:     self.pulse_in.clear()
Oct 26 21:13:10 rpi4 python3[6407]:   File "/home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/PulseIn.py", line 140, in clear
Oct 26 21:13:10 rpi4 python3[6407]:     self._mq.send("c", True, type=1)
Oct 26 21:13:10 rpi4 python3[6407]: OSError: [Errno 22] Invalid argument

I hope that help!. Thanks for the format tips and thank you again for your effort.

michaellass commented 3 years ago

Thank you for sticking with me here and providing all these information. I think I now know what is going on:

What I will do:

What you can do:

@Kraka23 and @darton : Can you confirm that the described setup matches your case and that either using a different user or modifying the logind.conf fixes the issue for you?

Krakatomi commented 3 years ago

Great @michaellass . Thank you so much. I apply the first option and create a system user to run the script. So far, so good!.

darton commented 3 years ago
    Exception in thread <function get_dht_data at 0x766998a0>:

Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "./rpims.py", line 221, in get_dht_data temperature = dhtDevice.temperature File "/home/pi/.local/lib/python3.7/site-packages/adafruit_dht.py", line 253, in temperature self.measure() File "/home/pi/.local/lib/python3.7/site-packages/adafruit_dht.py", line 198, in measure pulses = self._get_pulses_pulseio() File "/home/pi/.local/lib/python3.7/site-packages/adafruit_dht.py", line 137, in _get_pulses_pulseio pulses.append(self.pulse_in.popleft()) OverflowError: unsigned short is greater than maximum

combrs commented 3 years ago

Thanks all for clarification this issue, especially @michaellass I try to use sensorReporter project and got this 100% CPU issue also. And yes, I was lazy and run systemctl service also as user pi %) maybe you can add some diagnostic messages about (not recommended) running as user that have interactive shell as user shell configured, for example

michaellass commented 3 years ago

There is actually a pull request open to print an error message and quit libgpiod_pulsein in this case instead of just hogging the CPU: https://github.com/adafruit/libgpiod_pulsein/pull/2

Hopefully people will find this report and the solution quickly when searching for that error message.

scaidermern commented 3 years ago

What you can do:

  • Either one of the following to fix the issue:

    [...]

    • Modify /etc/systemd/logind.conf, set RemoveIPC=no and remove the # in the corresponding line.

I run into the same problem but the above workaround doesn't work for me.

I let my script start automatically on system boot via cron by creating a tmux session which runs this script. This works fine, however after some time (and I guess after some logins with the same user that runs the script) this known problem appears. libgpiod_pulseio uses 100% CPU and the following exception occurs when calling dht_device.temperature:

  File "/home/pi/.local/lib/python3.7/site-packages/adafruit_blinka/microcontroller/bcm283x/pulseio/PulseIn.py", line 143, in clear
    self._mq.send("c", True, type=1)
OSError: [Errno 22] Invalid argument

Unfortunately it looks like the next read of dht_device.temperature returns a cached value. So when catching and ignoring the above exception everything looks fine and the device still returns temperature and humidity values, however they just come from the cache.

michaellass commented 3 years ago

In my view the more clean solution is the first one, i.e.

Create a system user for your daemon (adduser --system) and use that for your daemon. logind won't cleanup message queues that belong to system users. Also you typically don't login and logout of them.

However, the workaround of setting RemoveIPC=no in logind.conf should work as well. After that change you probably need to either reboot the system or issue systemctl restart systemd-logind to restart logind.

That dht_device.temperature gives you a cached value is not unexpected. In general, you should not ignore Exceptions that you do not really expect. In this case I suggest ignoring RuntimeError but nothing else.

scaidermern commented 3 years ago

In my view the more clean solution is the first one, i.e.

Create a system user for your daemon (adduser --system) and use that for your daemon. logind won't cleanup message queues that belong to system users. Also you typically don't login and logout of them.

Yeah, I guess so.

However, the workaround of setting RemoveIPC=no in logind.conf should work as well. After that change you probably need to either reboot the system or issue systemctl restart systemd-logind to restart logind.

That's what I did (a reboot).

That dht_device.temperature gives you a cached value is not unexpected. In general, you should not ignore Exceptions that you do not really expect. In this case I suggest ignoring RuntimeError but nothing else.

Agreed.

michaellass commented 2 years ago

I'll close this issue. The phenomenon is caused by systemd/logind killing user's message queues when they log off. This is expected behavior. Making this obvious to users and avoiding the 100% CPU hog would have been addressed by https://github.com/adafruit/libgpiod_pulsein/pull/2 which never was approved.

darton commented 2 years ago

"makermelissa merged commit a91d387 into adafruit:master"