SuperHouse / esp-open-rtos

Open source FreeRTOS-based ESP8266 software framework
BSD 3-Clause "New" or "Revised" License
1.53k stars 491 forks source link

Fix lwip timebase - select(), maybe other things #123

Open projectgus opened 8 years ago

projectgus commented 8 years ago

(This is a good candidate for a "first bug" if someone is interested in contributing but unsure where to start.)

Bug was noted in #91 when using select() - the timeout argument to select appears to take RTOS ticks instead of seconds.

The implementation is in lwip/sys_arch.c and the bug may be as simple as the sys_now() function returning tick count instead of tick count.

However, before applying a fix someone should verify this and the other time-related functions in sys_arch.c.

After apply the fix, check the codebase for uses of select() or other time-sensitive LWIP functions.

RLuckom commented 8 years ago

Happy to start working on this. Sorry for the long note--I'm not super familiar with this stuff and I'm being extra careful to explain myself to make sure I understand correctly.

I noticed that http://git.savannah.gnu.org/cgit/lwip.git/tree/doc/sys_arch.txt#n202 explicitly states that sys_now() is supposed to return miliseconds.

Looking at the rest of that file, there are the following time-related functions [edited to remove earlier mistakes]:

sys_arch_sem_wait is also supposed to take a timeout param in milliseconds, and does so.

sys_arch_mbox_fetch is the same.

I didn't see any other functions in sys_arch that dealt with timeouts.

Looking closer at uses of sys_now, I found:

  1. lwip/lwip/config/lwipopts.h defines static inline uint32_t sys_now(void) as NOW()/(TIMER_CLK_FREQ/1000);. I don't see where the NOW() macro is defined, but assuming it is the same as tick count, that is returning ms. This function has a different type than the one in sys.arch.c, and I believe that in most places in lwip, the function from sys_arch.c, not this one, is being used.
  2. lwip/lwip/src/core/timers.c uses sys_now for most or all of its time and timediff needs. It uses sys_now to initialise timeouts_last_time (and reset it in restart_timeouts), and in sys_check_timeouts it is used to calculate the diff when determining if timeouts have expired.
  3. It's used in lwip/lwip/src/core/snmp/mib2.c in the macro SNMP_GET_SYSUPTIME(sysuptime) (sysuptime = (sys_now() / 10)), which would be reporting uptime in hundredths of a second if sys_now reported ms.
  4. used to build a timestamp in lwip/lwip/src/core/tcp_out.c
  5. in lwip/lwip/src/api/api_lib.c and lwip/lwip/src/api/api_msg.c sys_now is used to compare the amount of time a connection has been writing data against its timeout.
  6. in lwip/lwip/espressif/netio.c it's also used to set timestamps and check timeouts
  7. log lines in lwip/lwip/espressif/ping.c explicitly call sys_now units ms.

I've submitted https://github.com/SuperHouse/esp-open-rtos/pull/134 to start the ball rolling on a fix.

projectgus commented 8 years ago

I'm not certain I can reproduce this bug. Using the current master (4b39a0) I modified the SNTP example as follows, to use LWIP select() instead of delay():

diff --git a/examples/sntp/sntp_example.c b/examples/sntp/sntp_example.c
index 242489b..61292ca 100644
--- a/examples/sntp/sntp_example.c
+++ b/examples/sntp/sntp_example.c
@@ -54,7 +54,14 @@ void sntp_tsk(void *pvParameters)

        /* Print date and time each 5 seconds */
        while(1) {
-               vTaskDelayMs(5000);
+
+        struct timeval tv = { .tv_sec = 5 };
+        uint32_t start = xTaskGetTickCount();
+        printf("Pausing 5(?) seconds...\n");
+        select(0, NULL, NULL, NULL, &tv);
+        uint32_t end = xTaskGetTickCount();
+        printf("Took %d ticks.\n", end - start);
+
                time_t ts = time(NULL);
                printf("TIME: %s", ctime(&ts));
        }

The output suggests it's sleeping for 5 seconds each time I call select:

TIME: Sat May 28 06:00:13 2016
Pausing 5(?) seconds...
Took 500 ticks.
TIME: Sat May 28 06:00:18 2016
Pausing 5(?) seconds...
Took 500 ticks.

Verified with a stopwatch that each pause took 5 seconds, also.

Looking again at esp_mqtt_read() in MQTTESP8266.c, it takes an argument of milliseconds and then passes it to the tv_sec field of a timeval - which makes me think the actual bug was sleeping for X seconds instead of X milliseconds.

However, #134 also looks like a correct fix so we should probably merge that anyhow!