buelowp / sunset

Calculate Sunrise and Sunset and the phase of the moon accurately based on date and geographic position
GNU General Public License v2.0
127 stars 36 forks source link

Some latitudes can cause hangs #16

Closed Glichy closed 4 years ago

Glichy commented 4 years ago

First found by trying to test the new functions with Alert, Nunavut, Canada (82.31, -62.18). It kept causing it (NodeMCU 1.0, Arduino framework) to hang and be reset by the software watchdog. Tried to disable the software watchdog to see what would happen, but it just causes the hardware watchdog to trigger the reset. I then tested it with this (after going back to 1.0.11 and with a bigger change in lat first):

#include <Arduino.h>
#include <SunSet.h>

SunSet sun;
double sunrise;
double lat = 67.9; // 67.9 or -69.5

void setup()
{
  Serial.begin(115200);
  Serial.println();
}

void loop()
{
  Serial.printf("%f\t", lat);
  sun.setPosition(lat, 0, 0);
  sun.setCurrentDate(2020, 5, 26);
  sunrise = sun.calcSunrise();
  Serial.println(sunrise);
  lat += .01; // += or -=
}

After increasing then decreasing lat , this is what i got:

67.980000       6.99
67.990000       3.99
68.000000
Soft WDT reset
-69.570000      708.34
-69.580000      712.67
-69.590000      nan
-69.600000      nan
-69.610000      nan
-69.620000      nan
-69.630000      nan
-69.640000      nan
-69.650000      nan
-69.660000      nan
-69.670000
Soft WDT reset

I'm guessing it happens because those areas either have daylight throughout the day or none at all. Is it possible to make it fail in a graceful way?

buelowp commented 4 years ago

Technically, it's failing just fine, and you should be able to handle an unexpected return. I suspect that there isn't a value that would work there which is why it returns NaN, and simply returning another value in it's place isn't technically correct and may be more confusing. I'll need to figure out how to best handle this.

buelowp commented 4 years ago

Ok, I've been banging on this a bit today. The library seems to not be able to do accurate math on latitudes above a certain point. Though which point that is, I can't say. Using Point Barrow Alaska as a test

#define LATITUDE_PB     71.3875
#define LONGITUDE_PB    156.4811
#define TIMEZONE_PB     -8

I can't get a single value to be correct no matter the date. EG, March 21 2020 has a negative return value. Since I don't know how the math and the earth type calculations work, there isn't anything I can do to fix it correctly. Seems someone else figured out a solution, Google knows how to calculate it correctly. This library is based on code from 16 years ago that was tossed onto the internet and forgotten about.

For very large latitudes, you are going to have to be able to check for NAN and negative return values and discard them. If that includes your latitude of concern, I apologize. I just don't know enough about the internals of the math to be able to do this more correctly for those cases.

EDIT I think the reason you are getting hangs is that the 8266 isn't an especially strong 32 bit floating point processor. It does work, but the library clearly goes off into the weeds for very high latitudes which probably puts the 8266 into a state it can't recover from. I'm going to update the README to include a statement that this library doesn't work at very high latitudes. Also, I'm going to work on testing it to see what value it starts to fail at, but the complication is that I can't check every tenth of a degree for correct behavior, so it will be a check for nan or negative returns when I would expect positive returns. Then just reduce the max latitude by a tiny bit to make sure.

Sorry again, I wish I could do more for this one.

buelowp commented 4 years ago

OK, I just committed a change to the branch for fractional timezones. I've tested this for correctness a few ways from 0 to 90 latitudes. I am going to make checking for NaN a requirement for this library as 0 is technically a real time the sun could set depending on where you are in the world and your timezone. The values tested were from 0 to 90 incrementing by .1, and some spot checks showed that the actual value is correct, including it defaulting to NaN when the value wouldn't make sense as a return value.

I don't have a hardware debugger or an 8266 to test with to figure out why the NodeMCU doesn't work all the time there though. I plan to test it against a Particle Photon for extreme ranges when I get a chance this week, but that is the only micro I have access to to see what I can do. This doesn't crash on an RPi or Ubuntu for all possible ranges, so I suspect it's a FP issue with limited resource devices. That's a guess, but it's all I have right now.

Glichy commented 4 years ago

Though which point that is, I can't say.

It doesn't seem to be a fixed point, i changed the date in the test to 2050, 12, 14 and the results were different:

67.590000       705.73
67.600000       709.44
67.610000       nan
67.620000       nan
67.630000       nan
67.640000
Soft WDT reset
-65.950000      3.68
-65.960000      0.52
-65.970000
Soft WDT reset

the 8266 isn't an especially strong 32 bit floating point processor

It's probably not that good compared to others, considering it doesn't even have an FPU, but it takes around 3ms to do calcSunrise() in that loop, or 1.6ms if i set it to 160Mhz.

I also checked if it fails in the same places while at 160Mhz and it does.

buelowp commented 4 years ago

I've been doing some reading on the ESP devices, and the Floating Point performance (it does have a dedicated FPU) is terrible. I still have no idea why it fails, again, no JTAG or an ESP device to test against. But you may be up against a HW issue in this case. I do plan to test against an NXP M4 this week to see if I get similar results to the ESP or if it just works. I'll update when I can do that.

Note that at least the SW WDT is just a timer, so if an operation takes too long, it will trigger. It also will not fire if a processor lockup occurs. That I've tested. How long a timeout do you set? I would bet if the SW timer is going off, then something is taking too long either in my library or in your code, though what I don't know.

https://blog.classycode.com/esp32-floating-point-performance-6e9f6f567a69 https://www.esp32.com/viewtopic.php?f=14&t=800

Glichy commented 4 years ago

Both of those links are about the ESP32, not the 8266, the results i got when searching specifically about the 8266 always say it doesn't have one.

I've never set the timeout value, but this site seems to suggest its between 1.6 and 3.2 seconds and the hardware timer is 8.2 seconds: https://www.sigmdel.ca/michel/program/esp8266/arduino/watchdogs_en.html

I will try to debug it later and maybe also run the test on a STM32F103.

buelowp commented 4 years ago

Yes, I know. But the ESP-IDF documents state it also has the same FPU as the ESP32 I believe. I may have read that wrong. If it doesn't have an FPU, then the 8266 may not be a good choice.

I have abused the 8266 for SW timeouts, but never gotten one to fire. I thought you had to enable it explicitly. My mistake if not. SW watchdogs are usually must enable API's, not default because it's kinda confusing otherwise.

Glichy commented 4 years ago

If it doesn't have an FPU, then the 8266 may not be a good choice.

My use case doesn't really need to do multiple things at once and isn't time critical, so it's probably fine. I guess i will find out eventually.

Both watchdogs are on by default and only the software one can be disabled.

buelowp commented 4 years ago

I tested this against an M4 Particle Photon. It's the only other micro I have for a few weeks. I get no hangs, and I have no other problems with timing. I did the test you have above, it works up to 89.9 degrees without issue.

Glichy commented 4 years ago

Tested it on the F103 (no FPU either) and it worked just fine from -90 to 90. Something odd happens where the 8266 fails, calcSunrise() suddenly starts computing faster:

67.98   6.99    3281 us
67.99   3.99    3281 us
68.00   nan 1920 us
68.01   nan 1921 us
68.02   nan 1917 us
-69.69  nan 1918 us
-69.68  nan 1920 us
-69.67  nan 1920 us
-69.66  nan 3210 us
-69.65  nan 3210 us
-69.64  nan 3212 us
-69.63  nan 3212 us
-69.62  nan 3207 us
-69.61  nan 3211 us
-69.60  nan 3207 us
-69.59  nan 3211 us
-69.58  712.67  3313 us
-69.57  708.34  3322 us

The code used:

#include <Arduino.h>
#include <SunSet.h>

SunSet sun;
double sunrise;
double lat = -90;
uint32_t start;
uint32_t end;

void setup()
{
  Serial.begin(115200);
  Serial.println();
}

void loop()
{
  sun.setPosition(lat, 0, 0);
  sun.setCurrentDate(2020, 5, 26);
  start = micros();
  sunrise = sun.calcSunrise();
  end = micros();
  Serial.print(lat);
  Serial.print("\t");
  Serial.print(sunrise);
  Serial.print("\t");
  Serial.print(end - start);
  Serial.println(" us");
  if (lat > 90.001)
  {
    LowPower_standby();
  }
  lat += .01;
}

Really odd how the 8266 hangs on something that should complete faster than the ones that work. Do you think it's possible that it is an issue in the 8266 version of the Arduino framework?

I have also attached the results of the F103 going through the whole range, in case you or someone else is interested in them. They were gotten using version 1.0.11 but i checked if it still works using 1.1.0 just before commenting this and it does. capture bluepill_f103c8_128k.zip

buelowp commented 4 years ago

I think I'm going to close this as wontfix. I have updated the README to include some references to why the ESP8266 is probably a poor choice for this library, and I honestly have no mechanism to fix this for a non FPU capable environment. It's just not built to work in those cases.

If you disagree, let me know, otherwise, I'll close it tomorrow sometime.

Thanks for all the help.

Glichy commented 4 years ago

Would you be willing to keep it open until those ESP32 you mentioned are available for testing? I'm genuinely curious about the results on it, since if it runs fine on the F103 and it doesn't have an FPU, the issue might be somewhere else.

Otherwise, can you give me a couple more days? I have 2 more tests in mind but one of them requires learning several things i never dealt with.

Glichy commented 4 years ago

Good news!

I went ahead with the first test, which was basically copy-pasting the library into the project and adding println to the start of each function to see where it hangs. It led me to calcGeomMeanLongSun , added more prints there and it shows it hangs before the first while loop. It was trying to do it with a nan value and something doesn't like that. (Maybe it's the conversion to int?)

So i fixed it by checking if t is nan at the beginning of the function and returning nan if it is. After staring at the while loops for a while i noticed they could just be replaced with fmod, which also happens to fix the issue. They also seem to increase performance a little bit and can be used together.

Here are several serial captures showing the results ( Which also show the 8266 seems to be slower when the Wifi is off for some reason?): capture ESP8266.zip And a second test on the F103 to see if it still works: capture bluepill_f103c8_128k full isnan_fmod.zip

Sorry about the multiple pull requests, still learning how this works. The main reason it's not just the one with both fixes is, i only checked a few values to see if they match and didn't check if they match past the second decimal.

buelowp commented 4 years ago

Tested and included. Created release 1.1.3 with this change and a few others. I'm still listing the 8266 as a not valid device just because without the FPU, it's possible it won't work for everyone. I'm glad it works for you now.