Strange readings - Githubissues

Mr-HaleYa commented 4 years ago

I am using esp32 ttgo sim 800l (https://rb.gy/ogywnq) Dfrobot dht22 (https://www.dfrobot.com/product-1102.html)

my code

DHTNEW DHT(18); DHT.setWaitForReading(true);

float DHTtempBuffer[7];
float DHThumidityBuffer[7];
int DHTReadings = 7;             // number of DHT readings before choosing a median

void getDHT(){
  for (int i = 0 ; i < DHTReadings; i++) {
    DHT.read();                                     // Gets a reading from the DHT sensor
    DHThumidityBuffer[i] = DHT.getHumidity();
    DHTtempBuffer[i] = DHT.getTemperature();
  }
  sortData(DHThumidityBuffer, DHTReadings);                     //sorts array low to high
  DHThumidity = DHThumidityBuffer[(DHTReadings - 1) / 2];       //gets median of the array
  sortData(DHTtempBuffer, DHTReadings);                         //sorts array low to high
  DHTtemperature = DHTtempBuffer[(DHTReadings - 1) / 2];        //gets median of the array
}

So my problem is that even tho I use a filtering algorithm to get rid of highs and lows by taking 7 readings then choosing a median I will occasionally get some massive dips in temperature (and spikes in humidity as a result). my temperature in my room stays about 24°C and the dips are -12 and 12 °C so half the temp and half * -1. I have no clue why it is doing this... It runs every ~2 mins and when the bad readings start it does it consistently for a few runs before correcting itself.

Nothing moves nothing is touched nothing changes... It just goes wack... Could it be a library error? I don't see how it could be but you're a smart man @RobTillaart so you may see something I'm missing. I have also thought of the potential that its a dud sensor. I hope not bc i ordered 20 more before testing the one I'm using. I had the thought that maybe adding more reading (like 15 instead of 7) so there is more to data to filter but it looks like all the readings it's taking in those fault zones are bad...

Would appreciate any suggestion - Regards Hale

RobTillaart commented 4 years ago

Have one ESP32 + DHT22 + I2C 16x2 LCD ready, running this sketch ==> DHTNEW branch with data pin held HIGH between reads.

DHT_endless.zip

lets see in a few hours. so far 400++ good reads.

update ~2:45 hrs > 10K calls, still working > 52K calls still working

Mr-HaleYa commented 4 years ago

My DHT11 went most of the day and all night without a single error. Its very unusual and I'm not exactly sure why only the dht22 is having the problem for me

update: DHT 11 has been running for 2 days with no errors but the DHT22 has crashed twice and once recovered gives -12, 12 then goes back to normal readings. using 2 different ESPs and both are running the exact same sketch. Even tried switching them to see if one may be faulty to no avail.

Mr-HaleYa commented 4 years ago

When I was just taking single readings I had no problems so I doubt you will. It's only once I started doing multiple readings in a loop pass that I started running into errors.

You should first try and replicate my errors before testing solutions because if you can't do that then we have a whole different problem.

RobTillaart commented 4 years ago

DHT11 vs DHT22 The timing of both sensors are different internally, that is reason enough to see differences in behavior. DHT11 exports no floating point so I expect (== assumption) that the inner processor does not use it or use less of it. This makes code slightly simpler / faster. On the other hand I also expect that the inner processor of the DHT11 has a lower frequency as it takes longer to wake up (== assumption).

Bit shift error The 12 -12 error comes from a bit-shift which I have analysed and explained. The cause is that the data line stays higher longer than it should. That is definitely a hardware issue. The threshold is adjusted (adjustable compile time) in the library to handle it as well as possible but workarounds do never take away the root cause. This root cause can be a capacitance problem or a pull up resister too small problem, or ...? Need to see at least the signal on a scope to make a educated guess. The library now reports the bit shift error and that is about what the library reasonably can do.

Replication of your errors I do not have the same DHT22 board as you have and even if I did there are two sets of hardware, quite identical but never 100%. - different batches or suppliers can behave differently. https://www.dfrobot.com/product-1102.html shows there are 2 components (resistor? capacitor?) on the board but I do not know their value. My sensor uses no pull up as I use relative short lines. My test-hardware is powered over the USB of my laptop which provides a stable voltage, that is also a difference between our test setup.

My goal is to see if I can replicate the errors while using the library as intended. The sketch I use samples my DHT22 every second so I got one read that is a real read and two that are prevented by the library as they are too close in time. I can remove the delay from the sketch to do more frequent reads but these will all be answered "from last read cache" by the library anyway so the library will not communicate more often with the sensor.

Only thing I notice is that the timing of the reads differ between 4720 -5074 micros, so that is about 5% variation. This seems quite some variation, cause unknown, might be related to low level sensor read (ADC) inside the sensor and partly how the software is polling the data pin for bits.

Lessons learned All the hours of analysis on this issue learned me:

The bit shift error. The root cause is hardware related but not clear. It is detected by the checksum however it is typical enough to add as a separate error. A reread by the user code is simplest, smallest and most robust solution. [solved]
Function setType(), main use is for debugging, e.g. force to auto detect type again or to ensure a type. Might be removed for the 2.2 version although when not used the compiler would not link the code involved. [enhancement?]
Crash after X hours, very hard to reproduce, one can only wait for it. Similar crashes are mentioned at least a few dozen times on the internet, not only for ESP but also for other platforms. No explicit information how to generate such crash and no information how to prevent such crash was found. Only one post that stated one had to design the hardware correctly. It only mentioned one could add a capacitor over the GND VCC to stabilize / filter the power supply. [not solved] Note: It is a known problem that interrupts can and will disrupt the low level communication. To prevent interrupts during read the library can disable interrupts with setDisableIRQ(bool b ) It is not known if disabling interrupts prevents the crash, as it is very hard to reproduce these crashes.
Data pin free state == HIGH. The DHTNEW library (2.1 and before) does not take care of that. A quick code check of other DHT libraries showed that none of them did. It seems not needed, however as all datasheets mention this explicitly I will include an update for this in the library. [solved, PR pending] Note, the "floating data pin" [4] might be (part of) the cause of crash [3] above, but I have no evidence for that.
Power Pin, A work around the crash seems to be to switch of the sensor and restart it. The library could support a power pin enhancement so that the library can switch on / off the power of the sensor. Besides a possible workaround for the crash it has an added value in low power modes like battery operated sampling, or just to save (minimal) power. [issue #13 under investigation]

You should first try and replicate my errors before testing solutions because if you can't do that then we have a whole different problem.

So far I cannot reproduce the crash.

#include <dhtnew.h>
DHTNEW mySensor(18);

void setup()
{
  mySensor.setWaitForReading (true);
  Serial.begin(9600);

  Serial.println("\n1. Type detection test, first run might take longer to determine type");
  Serial.println("STAT\tHUMI\tTEMP\tTIME\tTYPE");
}

void loop()
{
  getDHT();
  Serial.println();
}

float DHTtempBuffer[15];
float DHThumidityBuffer[15];
int DHTReadings = 15;                              // number of DHT readings before choosing a median

void getDHT(){
  for (int i = 0 ; i < DHTReadings; i++) {
    Serial.print(mySensor.getType()); Serial.print("\t");
    int status = mySensor.read();                                     // Gets a reading from the DHT sensor
    Serial.print(status); Serial.print("\t");
    DHThumidityBuffer[i] = mySensor.getHumidity();
    DHTtempBuffer[i] = mySensor.getTemperature();
    Serial.print(DHThumidityBuffer[i]); Serial.print("\t"); Serial.println(DHTtempBuffer[i]);
  }
}

Did this script of yours show the crashes? If so, I can start it this evening.

RobTillaart commented 4 years ago

Ran my tests for 24 hours without any error so I will merge the PR for Data pin free state == HIGH There is no guarantee it solves the sudden crash but it did not generate one and maybe more important, it conforms to datasheets of DHT11, DHT21, DHT22

RobTillaart commented 4 years ago

Now over 142 K reads(~40 hrs) without a single error. I will now load your code to do a continuous read to see if that causes crashes. Only if we could find a scenario to trigger the crashes.

Protocol Keeping the data pin an input between reads (as the previous master did) is not correct looking at figure 2 of the DHT11 datasheet. At the end of the communication the MCU pulls the line explicitly HIGH. This can be concluded from the color of the line. Text states it is the PULLUP resistor. Electrically there is a small difference as the currents will flow slightly differently.

Generating a read failure I started running your test sketch with the minor addition that I print the variable i to see where in the for loop the code is. It makes a read after every "READ_DELAY" (2250) milliseconds because of the statement

mySensor.setWaitForReading (true);

which cause the library to block. That is by design for this sketch,

If I pull the data line LOW by "grounding", the sensor becomes unreadable:

11  22  0   39.30   25.00
12  22  -2  -999.00 -999.00
13  22  -2  -999.00 -999.00
14  22  0   39.50   25.00

status -2 == TIMEOUT The reading recovers as soon as I remove "ground" from data line.
TIMEOUT makes 100% sense as the MCU does not see the sensor pulling the data line HIGH again.

This is the closest I can come to trigger a crash,

So it would be interesting to measure the voltage of the data line when the sensor crashes. Can you measure with your next crash?

Mr-HaleYa commented 4 years ago

I don't know how to measure the voltage on the lines when it is installed inside of a box mounted on the side of a house outside ran off a solar power and it crashes at 5 in the morning when I am asleep... If I look at my chart it seems to crash every 10 hours almost like clockwork sometimes it's a couple hours early sometimes it's couple hours late but it seems to be on average about 10 hours. (I never seem to catch it during the 5 p.m. 😅)

Just as a side note our team decided to move forward with a dht11 in the product instead of a 22 due to the instability so they are no longer researching the 22 problem and we are moving forward with the 11 as it has been completely stable for the last 3 days.

However if you would like to continue investigating this I will stand with you and do as many tests as I can since I have a decent amount of free time because of the global situation. I can rig up a test station in the morning on some equipment indoors and stick a multimeter in it but I don't know when it's going to crash and if all catch it, that's the problem. I need like a video camera and a tripod and unlimited storage LOL.

Another note, we just got a new main board did uses a Sim 7000g instead of a Sim 800l (processor is an esp32 the sim is just a modem for Cellular connection) I have wrote all the ode and got it working but as I said before I'm currently using a dht11 if you would like me to try a dht22 on it to see if it is just that other board that I was using that is throwing these weird errors I can try it. I also have an esp32s that I could try with my code as well. (slightly suspicious of the board possibly being the cause of all this as it's not from a mainstream company) since we're just shooting in the dark now, who knows if they just have some internal errors with their circuitry that could be causing all of this

RobTillaart commented 4 years ago

Almost every 10 hours, mmmm, let me do some math.

Measurement every approx 2.250 second is ~26.6 per minute is 1600 per hour is 16000 per 10 hours. In itself no magic number but pretty close to max int16-t 16384. So could it be an overflow bug inside the sensor that causes the crash? Just don't know, could be a coincidence.

Measurement A way to measure the data pin during crash is to connect it to an analogPin that samples after every read. Only if it measures an out of expected range voltage you print time and voltage. An analogread takes 1 msec or less so it will be fast enough.

I will continue the test with your code as I want to improve the library, not that it must be able to correct all possible errors but at least detect them and give user the means to make most of the sensor.

3 days with a DHT11 is better but still far from dozens of years it probably have to work. In 24/7 projects I always think of ways to make a controlled restart possible, per sensor or actuator and sometimes even complete restart by means of a watchdog timer. If you have a pin left you could control the VCC powerpin of the DHT and switch it off /on every 1000 reads or so.

That is a bit the way NASA makes Mars rovers working for a decade. The cannot check with a voltmeter either ...

RobTillaart commented 4 years ago

Got about 150 timeout errors in a burst after approx 17000 measurements.

As there are multiple places where TO can occur I need to give them all an unique error code so I can see where in the protocol it fails. [Tomorrow]

There after it is running stable again for 4000 samples or so.

Strange it did not fail for 142000 reads in my code.... To be continued

RobTillaart commented 4 years ago

@Mr-HaleYa

No exact time measurements, => 1st order math indicates a read every 2.25 second (as it should)

35849 measurements 190 errors within 242 reads:

157 time out in multiple bursts around 17000
33 bit shift errors around 17000 too; 32 occurring after a time out read 1 occurring after a single good read (after multiple timeout reads)

After this burst none were seen.

Notes

As the errors occurred in a sort of burst it can be an cascading effect. The first error introduces the second etc. A burst could imply that the sensor has ample time to recover correctly once errors occur.
Found an important difference between your code and my endless loop. My endless program takes in practice 3000 millis between reads and yours takes DHTLIB_DHT22_READ_DELAY between reads. This latter is defined by you to 2250 millis after testing. This value seems very reasonable as the datasheet states 2000. Still it might be just on the edge for the sensor in some cases. I also found some mentions of a drifting millis() clock on both the ESP8266 and ESP32, both fixed. I make DHTLIB_DHT22_READ_DELAY 2500 instead == twice the DHT11 value (respect factor 2 of the datasheets)

new test Started a new test run with the following changes to the library / etst sketch.

DHTLIB_DHT22_READ_DELAY set to 2500
4 different DHTLIB_ERROR_TIMEOUT [A..D] for every possible timeout.
added a total counter for measurements in the sketch

If the DHTLIB_DHT22_READ_DELAY was too small we should not see errors again. If errors still occur, we will know which timeout loop in the handshake fails the total counter just makes math easier.

to be continued

Mr-HaleYa commented 4 years ago

I put a delay(3000) in my for loop just to see if that was the problem I even did delay(10000) for a run and still got the errors. it's not a timing issue in regards to delay between readings cuz even with waiting up to 10 seconds between readings I got the exact same patterns....

RobTillaart commented 4 years ago

Where exactly were your delay() calls , between calls to getDHT() or between calls to DHT.read()

I keep my test running as it will show which timeout fails (if any), no more ideas how to tackle

Except switching the power pin OFF ON after every read, or after every error, but that is a workaround, no understanding of the cause.

RobTillaart commented 4 years ago

@Mr-HaleYa

results test two

31110 measurements 101 errors within 102 reads:

86 time out in single burst 14373 - 14473
14 bit shift errors in same burst 1 bit shift error isolated 14324 ==> ~ 2 minutes before the burst.

After this burst none were seen.

Analysis What is very interesting to note is that with a delay of 2.250 millis it occurred after approx 17000 reads I stated before. In fact the number 17000 should be corrected with factor 15/16 th as the printout had an em empty lines after every 15 reads. 17000*15/160=15938

time of bursts? so the burst occurred after:

test one : 15938 x 2250 = 35.860.500 millis (99.80 %)
test two : 14373 x 2500 = 35.932.500 millis (100.00 %)

Note: these numbers are incredibly close, coincidence? Could be. Worth to give more thoughts to magic numbers around 36.000.000 milliseconds, [TODO]

error reduction? Note: the increase of the READ_DELAY with 10% reduced the errors

157 -> 86 time out errors = -46%
33 -> 15 bit shift errors = -55 %
ratio (157/33 = 4.8 vs 86/15 = 5.7) same order of magnitude.

So the 250 millis (~10%) extra time reduced ~50% of the errors. Of course it is only two runs so not statistically significant. However this reduction is perfect in line with my test sketch that uses 3000 millis intervals that had no errors.

And now? To verify the above a test three could be run with READ_DELAY of 2000 (datasheet spec) From the above we would expect

one or more bursts around 36.000.000 / 2000 == 18.000 reads this would confirm the magic timing
total errors - 380 (320 time out and 65 bit shifts) ==> linear increase of errors
total errors > 500 ==> quadratic or exponential increase of errors

A 4th test could be run with READ_DELAY of 2750 which should show

errors after ~13000 reads
total errors - linear (~40) - quadratic exponential (~0)

So thinking and testing ahead. I go for the datasheet timing.

RobTillaart commented 4 years ago

Appendum

All time out errors were all in the READY response of the sensor (2nd timeout loop, before the bits are clocked in) So sensor does not respond to START with a HIGH in time.

Timing diagram or START - READY signals from DHT11/21/22 datasheet

[TODO] Investigate timing specs of that part in detail.

RobTillaart commented 4 years ago

before starting long test, time to experiment.

Investigating READ_DELAY Setting the READ_DELAY to 100 (way too little) gives a pattern of 3 failed reads and 1 good one.

13:11:56.455 -> 1869    8   22  
13:11:56.548 -> 50  <-- LOW (DHT sends out response signal ...)
13:11:56.548 -> 78  <-- HIGH (DHT pulls up voltage ...)
13:11:56.548 -> 0   41.70   25.00

13:11:56.548 -> 1870    9   22  
13:11:56.642 -> 1       <-- DHT keeps data pin HIGH --> not ready!
13:11:56.642 -> 1071  <-- DHT keeps data pin HIGH --> not ready!
13:11:56.642 -> -21 -999.00 -999.00

13:11:56.642 -> 1871    10  22  
13:11:56.736 -> 1
13:11:56.736 -> 1071
13:11:56.736 -> -21 -999.00 -999.00

13:11:56.736 -> 1872    11  22  
13:11:56.829 -> 1
13:11:56.829 -> -64465  (16 bit wrap => 1070)
13:11:56.829 -> -21 -999.00 -999.00

13:11:56.829 -> 1873    12  22  
13:11:56.970 -> 56
13:11:56.970 -> 78
13:11:56.970 -> 0   41.60   25.00

so the time between 2 good reads is about 400 milliseconds, way lower than the 2000 millis from the datasheet. During further testing I could go as low as ~381 millis between good reads.

Setting the time to READ_DELAY to 500 (4x lower than datasheet advices) gives 1000++ correct readings in a row.

Setting READ_DELAY to lower values caused many TIME OUTS as the sensor is not ready.

Conclusions From these tests it is clear that the READ_DELAY of 2000 in the datasheet should be more than enough. In practice the sensor can be read faster than that. Furthermore it is clear that if the sensor is not ready the 2nd loops always times out. When the sensor is not ready it must be given more time to reset, how much is to be investigated, need the duration of the failing burst.

Think about a function calibrateReadDelay() ==> issue #20
Rename Timeout 2 to DHTLIB_SENSOR_NOT_READY.
Measure duration of a failing burst. ==> redo long test asap.

RobTillaart commented 4 years ago

Merged differentiation of error codes in master ==> adds to debug ability. version is now 0.2.2

Mr-HaleYa commented 4 years ago

Wow.... you are good 👍 so It was a delay issue, just not one that could be seen and/or changed easily. Do you recommend downloading the 0.2.2 and push an update to my devices or should I wait until you have the Timeout 2 fixed?

RobTillaart commented 4 years ago

Upgrading can be done. . 0.2.2 version does not solve the issue but gives better / detailed error information in the return value of read(). That is worth the upgrade. It means that if your code does an error value check it will fail on TIMEOUT. That value has been replaced by 4 others, and the BIT SHIFT error has been added. Check the examples.

The other changes do not affect the runtime code behaviour.

Factual this is a (minor) breaking interface change so it would be better to tag it 0.3.0. With next update I will get numbering right. For now I focus on the SENSOR NOT READY timing to see if and how it can be handled in a robust way.

RobTillaart commented 4 years ago

@Mr-HaleYa

Some analysis of test 3

End 16144   01:20:16.109
Start   15878   01:10:19.493
==============================
fail    266 00:09:56.616 = 596.616 millis = 2251.4 per call

so a failed calls are just as far apart as the good read() calls. The burst took about 10 minutes so not endless but long enough to be not acceptable.

        222     DHTLIB_ERROR_SENSOR_NOT_READY
        37      DHTLIB_ERROR_BIT_SHIFT
        7       DHTLIB_OK
        ----
        266

15878 * 2250 = 35.725.500 again very close to 36.000.000 (99.23%) 36.000.000 == 10 hours in milliseconds.

But what it means or what it is caused by? no clue

A google "ESP32 10 hours" only showed a WIFI related timeout. As there is no WIFI in the test sketch this is considered coincidence (for now)

However it looks like we can expect failures after this amount of millis(), so we have some sort of not understood way to reproduce the error.

Analyzing the good reads in the burst "Always look at the bright side of life", Monty Python In the middle of the burst some good reads are found, apparently the sensor recovers but in a struggling way. Can we learn something from their timing? Let us see:

TIME        COUNT   I   TY  RV  HUM TEMP

01:10:17.211 -> 15877   6   22  0   44.70   25.00   (last good)
burst
01:11:20.276 -> 15905   4   22  0   44.80   25.00
01:11:58.526 -> 15922   6   22  0   44.80   25.00
01:16:51.207 -> 16052   1   22  0   44.80   25.00
01:17:18.255 -> 16064   13  22  0   44.80   25.00
01:17:52.018 -> 16079   13  22  0   44.80   25.00
01:19:01.783 -> 16110   14  22  0   44.90   25.00
01:20:04.824 -> 16138   12  22  0   44.90   25.00
/burst
01:20:18.345 -> 16144   3   22  0   44.90   25.00   (first good after burst)

The intervals between the good calls look random,

    #calls       millis     
    28       63.039     
    17       38.274
    130     292.682     ~5 minutes
    12       27.017
    15       33.771
    31       69.793
    28       63.039
    6        13.508
========================================
average 34       75.140

It is hard to see any pattern here that gives any clue away. Pragmatic interpretation could mean that we must pause at least for ~75 seconds (OK, 90) after a failing call. Worst case we must wait for at least 5 minutes, in fact 10 to skip the whole burst.

If DHTLIB_ERROR_BIT_SHIFT is counted as a good read (it is not but almost) then we see a maximum of 27 calls between good calls which still is about minute. The average is 6 calls apart which is 15 seconds which sounds more acceptable.

Preliminary conclusion

The cause of the problem is not found yet, but detection of it works very well.
DHTLIB_SENSOR_NOT_READY can be detected faster as the code falls through TIMEOUT loop 1 when the data pin is kept HIGH. [improvement]
If the DHT22 sensor is not ready it needs time to recover. It is not known how much, the next test should focus on finding if such time can be found. I will adjust the test code so it increases the READ_DELAY after a failing read()
DHTLIB_ERROR_BIT_SHIFT occurs almost always in combination (after) with DHTLIB_SENSOR_NOT_READY. Ratio is roughly about 1 in 6. This implies if the latter error is solved we might also run out of the bit shift error.

made 33750 reads but after 16144 no more errors are found.

Mr-HaleYa commented 4 years ago

Have you been able to replicate my -12, 12 error or the -999. You say timeout but what even where my errors? -12, 12 are the bit shift and -999 is the timeout, correct?

Also why does this not happen with the dht11? Are they that different?

RobTillaart commented 4 years ago

Yes, not in terms of the exact values 12 , -12

The library 0.2.2 recognizes the bit shift error pattern and it overwrites the HUM and TEMP with -999 as the values 12-12 (or whatever is generated by shift) are incorrect.

My gut feeling says the BIT SHIFT is in a sensor that is not ready exact in time but almost, and its timing just fits within how the protocol is implemented.

I also see the -999 after 10 hrs of reading.

DHT11 vs DHT22 The DHT11 is definitely different, (here my list of assumptions)

DHT11 wake up time is 18 times longer so I expect it has a lower clock frequency,
DHT11 does no negative temperatures, or decimals so I can imagine it has no float math on board, in fact it would only need two lookup tables of ADC reading to HUM and TEMP, so 200 bytes at max.
DHT11 accuracy is 5% RH and 2C so an 8 bit ADC would be sufficient.
DHT22 with its larger range -40 .. 80C @ 0.1 would need an 11 bit ADC at least.
Given the timing of the protocol, the DHT11 should be at least a 1 Mhz processor, DHT22 probably higher to handle float math and/or more bits ADC.

So yes they are different for sure, the only commonality is the way they send out the 5 bytes that represent the values. Even the interpretation of those 5 is different.

The good thing that the DHT11 does not show the error gives quite some confidence that the error is not caused by the library or the ESP32 hardware.

Still open is why my 3 seconds delay worked without failures. Need more analysis but what? Time to think...

RobTillaart commented 4 years ago

Test was still running and I got a Second burst

13:40:16.227 -> 35839   3   22  -4  -999.00 -999.00
....
13:49:54.922 -> 36096   5   22  -4  -999.00 -999.00

Quick numbers, burst look same order of magnitude

End 36097   13:49:57.157
Start   35839   13:40:16.227
=====================================
    258 00:09:40.930

    194     DHTLIB_ERROR_SENSOR_NOT_READY
    34      DHTLIB_ERROR_BIT_SHIFT
    30  DHTLIB_OK
    ====
    258

time between bursts 19961 * 2250 = 44.920.250 millis

Start2  35839   13:40:16.227
Start1  15878   01:10:19.493
=======================
    19961   12:29:57....

no new insight yet...

RobTillaart commented 4 years ago

Started a testrun at 16:00 local time with the readDelay adjusted to 400
(yes that short, could make testing much faster)

Runs now for more than 3 hrs 28000++ calls to read() without errors.

19:13:37.745 -> 28531   0   22  0   41.10   23.00
19:13:38.140 -> 28532   1   22  0   41.10   23.00
19:13:38.543 -> 28533   2   22  0   41.10   23.10
19:13:38.947 -> 28534   3   22  0   41.10   23.10
19:13:39.387 -> 28535   4   22  0   41.10   23.00
19:13:39.761 -> 28536   5   22  0   41.10   23.00
19:13:40.183 -> 28537   6   22  0   41.10   23.00
19:13:40.558 -> 28538   7   22  0   41.10   23.00
19:13:40.980 -> 28539   8   22  0   41.10   23.00
19:13:41.402 -> 28540   9   22  0   41.10   23.00
19:13:41.777 -> 28541   10  22  0   41.10   23.00
19:13:42.194 -> 28542   11  22  0   41.10   23.10
19:13:42.597 -> 28543   12  22  0   41.10   23.00
19:13:43.005 -> 28544   13  22  0   41.10   23.00
19:13:43.409 -> 28545   14  22  0   41.10   23.00

It looks like that the SENSOR_NOT_READY error is not triggered by the amount of calls to the sensor. To be continued...

RobTillaart commented 4 years ago

Test 4 started yesterday at 16:00 with READ_DELAY of 400

It is now 12:20 next day so it has run for over 20 hours doing 181.000++ reads.

Only "one burst" of errors occurred, with intermittent good reads

Time frame

01:56:15.384 -> 88327   6   22  -4  -999.00 -999.00
....
02:03:38.849 -> 89429   13  22  -4  -999.00 -999.00

The test started at 16:00, the burst started at 01:56, so we again find a crash 9:56 ~10 hours after starting the test. 88327 * 400 = 35.330.800 so again near 36M again.

Duration

End 89429   02:03:38.849
Start   88327   01:56:15.384
=====================================
    1102    00:07:23.465

    648     DHTLIB_ERROR_SENSOR_NOT_READY
    92      DHTLIB_ERROR_BIT_SHIFT
    362 DHTLIB_OK
    ====
    1102

The burst of failing reads is shorter than the previous one but still 7+ minutes, way too long. Note that there were relative more good reads during the burst. Also interesting is that the ratio between NOT_READY and BIT_SHIFT is in the same order of magnitude.

Although there were more good reads in between, I still consider this a single burst of failure as readings were not stable during this 7:23 minute period.

Conclusions so far

So after ~10hrs something happens that causes reads to crash. Whatever the cause, it is seems safe to conclude the crashes are

time ~10hr triggered (3x measured within ~2%, multiple estimates same order)
not triggered by the amount of read() calls
not triggerd by the READ_DELAY which must of course not be too small.

Long duration tests with DHT11 instead of DHT22 do not show this crash after ~10hrs. So we can (carefully) conclude that the ESP32 does probably not cause the failure. It is more probable that the root cause must be found in a timing issue in the DHT22 itself. Unfortunately we cannot look inside (the code of) the DHT22.

This investigation has not found the cause. It did however find points to improve the library although these were not related to the problem as the problem still occurs. The test started yesterday still runs to see if a second burst occurs like in the previous.

I will do more tests and if there are new insights I will report them here.

RobTillaart commented 4 years ago

And yes I got a Second burst

14:26:04.722 -> 199643  7   22  -3  -999.00 -999.00
....
14:33:28.974 -> 200747  1   22  -4  -999.00 -999.00

Quick numbers, burst look same order of magnitude Similar ratio between SENSOR_NOT_READY and BIT_SHIFT

End 200747  14:33:28.974
Start   199643  14:26:04.722
=====================================
    1104    00:07:24.252

    668     DHTLIB_ERROR_SENSOR_NOT_READY
    103     DHTLIB_ERROR_BIT_SHIFT
    333 DHTLIB_OK
    ====
    1104

Time between bursts 110.214 * 400 = 44.085.600 millis 44M does not ring any bell as special number.

Approx same time difference as between the 1st and 2nd burst in run 3 7 minutes on 12:22 hours = ~1% difference.

Start2  199643  14:26:04.722
Start1  89429   02:03:38.849
=======================
    110214  12:22:26....

The test run continues to see if there is a 3rd burst at same interval.
Expected at 14:30 + 12:30 = around 03:00 tomorrow....

Mr-HaleYa commented 4 years ago

I'm glad you were able to replicate my errors so at least I know I'm not the only one LOL. You said long-term test with the dht11. I have had one running for 7 days and it hasn't thrown a single error, but as soon as I put a dht22 in there it throws errors... So yes it must be some kind of internal timing failure inside the dht22 itself.

since there is no way to directly alter the code inside of the dht22 and even if there were, common users probably would not have the capabilities to do it, to believe this problem is even fixable or is it just a bug by nature of poor design?

RobTillaart commented 4 years ago

Hi, The variation in times measured seems to me the variation of the internal clock. Quite sure it has no crystal on board. As the timing of the burst seems - within some margin - quite repeatable we may conclude it is systematic.

As the amount of calls the internal processor needs to handle it is quite sure not a call related memory leak as that would pop up after the same # calls. It could be a time related memory leak but we have two duration's 10 hrs (36 M) and 12+ hrs (44M). And I cannot match those on typical magic numbers of a processor which are powers of 2 mostly.

What is approx a power of 2 is the difference 44M - 36M = 8M which is 2^23, and I happen to know that the mantissa of IEE754 floating math has 23 bits. Sort of mantissa overflow? But that seems to far out. And we do not get a failure every 2h20m.

The DHT22 has higher precision, and could use float math for the temp sensor and the humidity sensor. As the ranges are larger and precision is better, its internal code is at least not the same and more complex than the DHT11.

As the wakeup delay of the DHT11 is roughly 16x the wakeup delay of the DHT22 it could be that the internal clocks differ by a factor 16. If they both have the same bug the DHT11 could have failed only after 10 hrs x 16 = 6 days and 16 hours. As you tested for 7 days it should have popped up.

What is also intriguing is that the bursts take 7-10 minutes and then the sensor stabilizes again for at least 10 hrs as far as we know.

A thought that passed my mind is that the sensor has some internal "calibration" routine that is executed every X hours. Such calibration could make a sensor less responsive for some time. But that is a (big) assumption too.

In short, the bug is too repeatable for me to be a bug by nature so I can only assume it is a bug (or feature with side effects) in the internal code of the sensor but I cannot prove it.

RobTillaart commented 4 years ago

@Mr-HaleYa

And yes!, got the third burst at the predicted time! So pretty sure to conclude it is systematic.

This time a single BIT_SHIFT error was seen 15 minutes before the BIG burst.

2:40:14.845 -> 308628   2   22  -3  -999.00 -999.00   BIT_SHIFT (single)
(15 minutes only good)

02:55:39.630 -> 310917  11  22  -4  -999.00 -999.00
...
03:03:22.356 -> 312066  5   22  -4  -999.00 -999.00

Quick numbers, burst look same order of magnitude Similar ratio between SENSOR_NOT_READY and BIT_SHIFT

End 312066  03:03:22.356
Start   310917  02:55:39.630
=====================================
    1149    00:07:42.276

    575     DHTLIB_ERROR_SENSOR_NOT_READY
    110     DHTLIB_ERROR_BIT_SHIFT
    464 DHTLIB_OK
    ====
    1149

Time between bursts 111319 * 400 = 44.527.600 millis Same order of magnitude as the previous period between bursts.

Start3  310917  02:55:39.630
Start2  199643  14:26:04.722
=======================
    111319  12:29:35....

So we have now three intervals between bursts

    INTERVAL    SECONDS     DURATION
run3    12:29:57    44997       9:40 = 582
run4,1  12:22:26    44546       7:24 = 444
run4.2  12:29:35    44975       7:42 = 462
=======================================================
Average 12:24:56    44696       8:16 = 496

(rough math) Burst Interval = 12:24:56 +- 1% Burst duration = 00:08:16 +- 20%

Note: I recall that duration's in first bursts are also in the 7-10 minutes range. I have logged 00:09:56 and 00:07:23 in above notes. If these are included the average burst duration is: 8:25 +-18% so not that different. No new insight, only confirming numbers.

Still cannot explain why the first burst is after ~10:00 hrs and not after 12:25 hrs.

Also noticed that the burst looks like a sort of wave

ramp up - number of fails increase
a hard core - no good reads at all (~240 in this burst)
ramp down - number of fails decrease (should make a histogram per 30 seconds to visualize this)

Conclusion The occurrence of a burst seems predictable. Next burst is expected at 03:00 + 12:25 = 15:25

Moon ? bug by nature? I googled the time 12:25 and came across this Frequency of tides Because the Earth rotates through two tidal “bulges” every lunar day, coastal areas experience two high and two low tides every 24 hours and 50 minutes. High tides occur 12 hours and 25 minutes apart.

Moon calculator The moon rose at 1:20 ; highest at 5:33 and set at 9:55 so no specific position of the moon during the burst. So the time commonality looks like a coincidence. Unfortunately, otherwise we would have found a "moon phase detector" which would be pretty cool :)

coincidences are events we accept without questioning them further

Mr-HaleYa commented 4 years ago

@RobTillaart

I was looking at some forums seeing if I could find anyone having a similar issue and it appears almost everyone has a similar issue where it just stops reading. I couldn't find anyone talking about the 10-hour shut off then recover then 12-hour shut off then recover like we are.

I did find some people say put a decoupling capacitor between the VIN and ground and that can fix the error the other users were talking about, something about you can get a ripple from when the Wi-Fi kicks in on the board.

since we're just grasping at straws now could it possibly be the internal clock of the esp32 doing something at 10 hours when it boots and then 12 hours and that could be causing the interesting readings because of the ripple similar to what Wi-Fi kicking on can cause? maybe you should throw up a test station with an 8266 to see if we get the same results on that. Quick way to test that.

a moon phase detector would have been the coolest thing ever for such a cheap sensor LOL. very disappointed now... 😂

RobTillaart commented 4 years ago

mmmm some thinking If the capacitor in/on the DHT22 is different than the capacitor in the DHT11, then it would be possible that effect is caused by the ESP32 (timing) and that one cap "solves" it and the other doesn't. Another long run with extra cap could test it.

Do not have ESP8266 nearby, would need to order them. I could run a long test on an UNO to see if the 12hr or 10hr crashes appear there too. If not the finger would point more to the ESP32?

Currently I have other tasks to attend so it might take a few days before I'll be back.

RobTillaart commented 4 years ago

@Mr-HaleYa FYI Burst came at 15:26:05 - 15:33:33 = 00:07:28 => Confirms timing
649 NOT_READY vs 86 BIT_SHIFT in burst size 1087 no time to elaborate (saved the log for later if time permits)

RobTillaart commented 4 years ago

Last night burst at expected time - expected duration - expected ratio (no new insight)

03:55:10.198 -> 533513  7   22  -4  -999.00 -999.00
...
04:02:42.609 -> 534637  6   22  -3  -999.00 -999.00

duration 00:07:32.411 interval around 12.5 hr burst size: 1124 665 NOT_READY 86 BIT_SHIFT

Going to test with UNO, if that shows the burst of failures too, the prime suspect will be DHT22 sensor otherwise the ESP32 still is an option.

started UNO test: 09:12:51 started with 14 x SENSOR_NOT_READY --- stopped that test to redo the test with same DHT22 sensor I used for ESP32

started UNO test2: 10:10:02 started with 14 x SENSOR_NOT_READY (no start up delay first second ...)

RobTillaart commented 4 years ago

mmm

it misses bits

10:13:50.618 -> 556 0 11 0 3.70 22.70

should be around ~50% humidity.. adjust timing between reads to datasheet specs (2000 millis) now reads seem OK.

Timing of an UNO is apparently different enough to make a difference.

update - 700 reads looks stable now

RobTillaart commented 4 years ago

It is just past 08:00, UNO has run 39204 reads from the same DHT22 with 2000 millisec intervals. 78000 seconds = 21: 40 minutes

=> zero failures So not seen the crash around 10 hrs and not seen a crash after 12:25 hours. OK for the last one we should wait another 2 hours as it appeared after 22:25 after startup. to be continued

Next test will be an ESP32 24 hr run with explicit interrupts disabled.

Mr-HaleYa commented 4 years ago

so, so far it looks like I was right???? about it being specifically bc of esp32

RobTillaart commented 4 years ago

Right in which statement? I cannot prove any cause yet as long as I cannot reproduce the crash at "any" moment. There is a reproducible scenario now but no root cause identified.

Yes, a failure occurs when reading the sensor too fast (within 350 ms) but faster than 1000 / 2000 is specified as being incorrect in the datasheet.

RobTillaart commented 4 years ago

UNO + DHT22

17:42:54.216 -> 56375 4 22 0 57.60 24.30

56375 reads every 2000 millis, = 112750 seconds = 31:19:10 ZERO fails

The failure moment is not affected by the interval chosen as we proved before with the ESP32 test. The DHT22 is the same, the library is the same, only change is the processor. So this makes the ESP32 the next "suspect" of the burst of failing reads.

I will now start a test with an ESP32 24 hr run with explicit interrupts disabled.


// Mr-HaleYa testscript

#include <dhtnew.h>
DHTNEW mySensor(16);

uint32_t count = 0;

void setup()
{
  Serial.begin(115200);

  mySensor.setWaitForReading(true);
  mySensor.setReadDelay(2000);
  mySensor.setDisableIRQ(true);

  Serial.println("\n1. Type detection test, first run might take longer to determine type");
  Serial.println("CNT\tI\tTYPE\tSTAT\tHUMI\tTEMP");
}

void loop()
{
  getDHT();
  Serial.println();
}

float DHTtempBuffer[15];
float DHThumidityBuffer[15];
int DHTReadings = 15;                              // number of DHT readings before choosing a median

void getDHT() {
  for (int i = 0 ; i < DHTReadings; i++) {
    count++;
    Serial.print(count);
    Serial.print("\t");
    Serial.print(i);
    Serial.print("\t");
    Serial.print(mySensor.getType());
    Serial.print("\t");
    int status = mySensor.read();                                     // Gets a reading from the DHT sensor
    Serial.print(status);
    Serial.print("\t");
    DHThumidityBuffer[i] = mySensor.getHumidity();
    DHTtempBuffer[i] = mySensor.getTemperature();
    Serial.print(DHThumidityBuffer[i]);
    Serial.print("\t");
    Serial.println(DHTtempBuffer[i]);
  }
}

RobTillaart commented 4 years ago

Started

17:57:59.309 -> 1. Type detection test, first run might take longer to determine type
17:57:59.309 -> CNT I   TYPE    STAT    HUMI    TEMP
17:57:59.309 -> 1   0   0   0   55.20   25.20
17:57:59.309 -> 2   1   22  0   55.10   25.20
17:58:01.336 -> 3   2   22  0   55.10   25.10
17:58:03.338 -> 4   3   22  0   54.90   25.20
17:58:05.318 -> 5   4   22  0   54.80   25.10
17:58:07.335 -> 6   5   22  0   54.70   25.20

Mr-HaleYa commented 4 years ago

So this makes the ESP32 the next "suspect" of the burst of failing reads.

This is what I meant by it appears that I am right. My suspicion was that the esp32 is causing it.

RobTillaart commented 4 years ago

This is what I meant by it appears that I am right. My suspicion was that the esp32 is causing it.

I cannot prove it is the ESP32, as the DHT11 test you did work for more than 7 days. I am open for suggestions how to test this hypothesis.

RobTillaart commented 4 years ago

@Mr-HaleYa

Disabling the interrupts did not solve the DHT22 reading failures

03:51:04.386 -> 17759   13  22  -4  -999.00 -999.00
...
03:57:42.800 -> 17958   2   22  -4  -999.00 -999.00

Burst duration: 8.38 same range The ratio between SENSOR_NOT_READY and BIT_SHIFT is 4 to 1 This is different from other measurements about 6 to 1, improvement in theory as bit shifts can partial be recovered from but no help in solving the issue.

End 17958   03:57:42.800
Start   17759   03:51:04.386
=====================================
    199 00:08:38.414

    130     DHTLIB_ERROR_SENSOR_NOT_READY
    34      DHTLIB_ERROR_BIT_SHIFT
    33  DHTLIB_OK
    ====
    199

Time to bursts 17759 * 2000 = 35.518.000 millis Same order of magnitude as the previous first bursts.

Conclusion The problem is hard and no new insights are made so far. Test will run to see if the next failure is around 03:55 + 12:25 = 16:20

Mr-HaleYa commented 4 years ago

I cannot prove it is the ESP32, as the DHT11 test you did work for more than 7 days.

I did not do a full 7 days (slight exaggeration...) I got through about 4 1/2 I think and then I rebooted it because I had to do some maintenance so if you were saying the error should have occurred on the 6th day, I never got there. I will have one up and running for 7 days though once I get all of my problems sorted out.

I am open for suggestions how to test this hypothesis.

it only seems to happen when using an esp32 with the dht22 so if we try more boards such as an 8266 an Uno a mega a nano and if none of them give problems but the 32 then there has to be a conclusion that the 32 is causing the error. right? (Even if we don't explicitly know why)

RobTillaart commented 4 years ago

Analyzing the bug takes already a lot of my time and I will not test other boards as these will not give me information as far as I can see.

From my head I have never seen the error in combination with the MEGA and NANO, that is no proof for this version of the lib etc. I recall that a similar problem with the DHT22 existed with a fore runner of this lib - DHTstable - in combination with a 8266. few years ago. I cannot recall the solution but if the bug/solution was in the library it would have migrated into DHTnew. I might need to check if I can find that.

You are not right, you mix up correlation and causation. See - https://www.tylervigen.com/spurious-correlations

If we see the problem only occurring in the combination of DHT22 and ESP32 we cannot conclude which one causes it. We only see the correlation. (that is the point we are now) If we don't see the problem occurring on any other platform, yes that makes the ESP32 more suspect, but it is no proof for causation. For what it is worth it still can be the library as well.

Most important step to make is a way to trigger the error explicitly at any time.

RobTillaart commented 4 years ago

Bit of searching and reading found back the 8266 discussion

https://github.com/RobTillaart/Arduino/issues/102 Solution was a WIFI setting if I read correctly. As your test script does not use WIFI, could this be the cause?

from that thread

Sorry Rob, my last comment was not so clear. To try to clarify, I don't use a new version of Adafruit library or a new version of WiFi library. By code version 0.2 on Github, I was referring to the last version of my own code which uses the ESP8266 Wifi in station mode only (STA). By default, the ESP8266 runs WiFi in AP+STA mode. My previous code version was using WiFi in default mode (so AP+STA). My current code version (0.2) uses WiFi in STA mode only. I assume this is the reason I don't get no more NAN. My try with your own library was also with the version 0.2 of my own code.

ADAfruit DHT22

Checked latest ADAfruit library - https://github.com/adafruit/DHT-sensor-library/blob/master/DHT.cpp (line numbers may vary in future)

line 244: yield() call before actual read to handle WIFI interrupts might relate to the WIFI 10 hr failure mentioned above -> can be included.
line 261: wake up delay is 10% longer for all sensor types. -> can be included.
Line 275: delayMicroseconds(pullTime); waits default 55 usec - DHTnew does 40. -> can be included.
Line 281: InterruptLock lock; is another way to prevent interrupts. Important difference is that the ADAfruit code disables interrupts always but for a shorter period. -> can be changed
ADAfruit time out loops differ slightly, after about 1 millisecond. -> Mine are longer so not interesting.
ADAfruit has complete different way of bit reading. -> Error occurs before this part of protocol so not interesting.

I will add the timing/ interrupt related points [1..4] and do a 24 hr test run when time permits...

RobTillaart commented 4 years ago

@Mr-HaleYa

Results of test with interrupts disabled. As predicted the failure burst came at 16:20

16:20:16.428 -> 40189   3   22  -4  -999.00 -999.00
...
16:26:46.817 -> 40384   3   22  -3  -999.00 -999.00

Burst duration: 6:30 lowest so far but same order of magnitude The ratio between SENSOR_NOT_READY and BIT_SHIFT is 4.6 to 1 This is lower than other measurements ~6 to 1.

End 40384   16:26:46.817
Start   40189   16:20:16.428
=====================================
    195 00:06:30.389

    129     DHTLIB_ERROR_SENSOR_NOT_READY
    28      DHTLIB_ERROR_BIT_SHIFT
    38  DHTLIB_OK
    ====
    195

Interval since previous burst 40189 - 17958 = 22231 * 2000 = 44.462.000

Conclusions

burst pattern confirmed
ratio pattern (more or less) confirmed
Disable interrupt flag does NOT prevent crashes. This is a new information. However it gives me no insight yet about the cause.

Time for next test with patches mentioned above

Mr-HaleYa commented 4 years ago

So I'm using an esp32 with a cellular modem on it and I use this cellular modem to send my data to my web server but I do not use the Wi-Fi on it at all. I'm curious if I still need to disable Wi-Fi because as soon as the chip boots up it turns the Wi-Fi modem on, you just don't use it so it's a low-power mode. Same goes for the Bluetooth module, I wonder if that needs disabled as well.

RobTillaart commented 4 years ago

@Mr-HaleYa

Test with extended timing + extra yield() call ran for 12 hours and ZERO failures so far. Test is now halfway its 24 hrs run and this looks promising as it skipped the 10 hr read failure. Question is which of the 4 changes made the difference? Or is it the combination of some/all?

So it looks like disabling wifi / bluetooth is not needed.

Mr-HaleYa commented 4 years ago

That's promising! What exactly did you change? Just added 1 and 4? Can you post a snippet.

RobTillaart commented 4 years ago

@Mr-HaleYa I pushed my test branch so you can see the deltas they include 1,2,3 and 4 of the list above + some changes I made for version 0..3.0 e.g. to setReadDelay().

https://github.com/RobTillaart/DHTNew/tree/test_timing

I want to understand which timing changes did the trick, or was it the call to yield()? Two additional test runs are needed to see when it fails again.

The other changes I will revert as I want to keep implementation in line with the datasheet as much as possible.

Mr-HaleYa commented 4 years ago

@RobTillaart It is saying your test branch was updated 6 days ago... can you confirm that you pushed it.

RobTillaart / DHTNew

Strange readings #11