Unexpected behaviour with poll and tickless scheduler

g2gps commented 9 months ago

Hello,

I'm using poll in conjunction with timerfds and the POLLIN event to schedule actions at different rates. I'm finding that when CONFIG_SCHED_TICKLESS is used, poll is returning with only a subset of my file descriptors in revents field, even though I'm specifying the timeout to occur at the same time.

Here's a working example which demonstrates the issue, in which I:

Setup 8 timer file descriptors to have their first expiry after 3 seconds, and then expire at a 1 second rate.
Poll continuously for POLLIN events.
Report the expired events and print which fds expired as a hex value.

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <poll.h>
#include <sys/timerfd.h>
#include <stdint.h>
#include <unistd.h>

#define TIMERFD_COUNT 8

static int fds[TIMERFD_COUNT] = {-1};

int create_timers(void)
{
  struct itimerspec its;

  clock_gettime(CLOCK_MONOTONIC, &its.it_value);

  // Set initial alarm for 3 seconds in the future
  its.it_value.tv_sec += 3;
  its.it_value.tv_nsec = 0;

  // Set interval to 1 second
  its.it_interval.tv_sec = 1;
  its.it_interval.tv_nsec = 0;

  for(int i = 0; i < TIMERFD_COUNT; i++)
  {
    fds[i] = timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC);
    if(fds[i] < 0)
    {
      return fds[i];
    }

    if (timerfd_settime(fds[i], TFD_TIMER_ABSTIME, &its, NULL) < 0)
    {
      return errno;
    }
  }

  return 0;
}

void cleanup(void)
{
  for(int i = 0; i < TIMERFD_COUNT; i++)
  {
    if(fds[i] > 0)
    {
      close(fds[i]);
    }
  }
}

void timer_read(int fd)
{
  uint64_t expirations;
  read(fd, &expirations, sizeof(expirations));
}

size_t wait(void)
{
  struct pollfd pfds[TIMERFD_COUNT];
  const int timeout_ms = 10000;
  int ret;
  size_t mask = 0;

  for(int i = 0; i < TIMERFD_COUNT; i++)
  {
    pfds[i].fd = fds[i];
    pfds[i].events = POLLIN;
  }

  ret = poll(pfds, TIMERFD_COUNT, timeout_ms);
  if(ret < 0)
  {
    exit(EXIT_FAILURE);
  }

  for (int i = 0; i < TIMERFD_COUNT; i++)
  {
    if (pfds[i].revents == POLLIN)
    {
      mask |= (1 << i);
      timer_read(pfds[i].fd);
    }
  }
  return mask;
}

int main(int argc, char *argv[])
{
  size_t mask;
  int ret = create_timers();
  atexit(cleanup);

  if(ret < 0)
  {
    printf("Cannot create timers: %d\n", ret);
    exit(EXIT_FAILURE);
  }

  printf("Start poll\n");
  while(1)
  {
    mask = wait();
    printf("mask 0x%zX\n", mask);
  }

  exit(EXIT_SUCCESS);
}

Expected output.

I've tried the following example on the following platforms:

Linux: with gcc poll_test_main.c -o poll_test && ./poll_test
sim:nsh with CONFIG_TIMER_FD added.
qemu-rv:knsh32
arty_a7:knsh

In each case I get:

nsh> poll
Start poll
mask 0xFF
mask 0xFF
mask 0xFF
mask 0xFF
mask 0xFF
mask 0xFF
mask 0xFF

Which is what I am expecting.

However, when I use a tickless scheduler, such as:

qemu-armv8a:nsh_smp_tickless
arty_a7:knsh-tickless

I get the following:

nsh> poll
Start poll
mask 0x80
mask 0x70
mask 0xF
mask 0x80
mask 0x70
mask 0xF
mask 0x80
mask 0x70
mask 0xF
mask 0x80
mask 0x78
mask 0x7

In this case poll needs to be called 2-3 times for all expiry events to be returned. It doesn't seem to miss any events, they just are not all returned at the same time.

I haven't looked into this too much yet, but I feel like the underlying cause may also be what is causing the issue in #9840.

If there's any interest, I can add this example program, as well as any additions needed to the mentioned default configurations in our NuttX fork, to ease testing.

I would appreciate and feedback or suggestions.

g2gps commented 9 months ago

I was able to test this out on the ESP32-C3 devkitc, same issue:

ESP-ROM:esp32c3-api1-20210207
Build:Feb  7 2021
rst:0x1 (POWERON),boot:0xc (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fcd5820,len:0x16c0
load:0x403cc710,len:0x87c
load:0x403ce710,len:0x2f54
entry 0x403cc710
I (30) boot: ESP-IDF v5.1-dev-3972-g1559b6309f 2nd stage bootloader
I (30) boot: compile time Mar 15 2023 12:14:06
I (31) boot: chip revision: v0.3
I (35) boot.esp32c3: SPI Speed      : 80MHz
I (40) boot.esp32c3: SPI Mode       : DIO
I (44) boot.esp32c3: SPI Flash Size : 4MB
I (49) boot: Enabling RNG early entropy source...
I (54) boot: Partition Table:
I (58) boot: ## Label            Usage          Type ST Offset   Length
I (65) boot:  0 factory          factory app      00 00 00010000 00100000
I (73) boot: End of partition table
I (77) esp_image: segment 0: paddr=00010020 vaddr=3c020020 size=0a41ch ( 42012) map
I (92) esp_image: segment 1: paddr=0001a444 vaddr=3fc83630 size=001c0h (   448) load
I (94) esp_image: segment 2: paddr=0001a60c vaddr=40380000 size=01a84h (  6788) load
I (103) esp_image: segment 3: paddr=0001c098 vaddr=00000000 size=04060h ( 16480) 
I (113) esp_image: segment 4: paddr=00020100 vaddr=42000100 size=1bb98h (113560) map
I (137) boot: Loaded app from partition at offset 0x10000
I (137) boot: Disabling RNG early entropy source...

NuttShell (NSH) NuttX-10.2.0
nsh> poll
Start poll
mask 0xFF
mask 0x1
mask 0xFE
mask 0x1
mask 0x2
mask 0xFC
mask 0x1
mask 0x2
mask 0x4
mask 0xF8
mask 0x1
mask 0x2
mask 0x4
mask 0x8
mask 0xF0
mask 0x1
mask 0x2
mask 0x4
mask 0x8
mask 0x10
mask 0x20
mask 0xC0
mask 0x1
mask 0x2
mask 0x4
mask 0x8
mask 0x10
mask 0x20
mask 0x40
mask 0x80
mask 0x1
mask 0x2
mask 0x4
mask 0x8
mask 0x10
mask 0x20
mask 0x40
mask 0x80

In this case the call to poll produces all events. Subsequent calls return less events.

acassis commented 9 months ago

@g2gps are you using Tickless mode with RR ? Try to use CONFIG_RR_INTERVAL=200 for example and see if it improves the results. I think that in tickless mode each event itself is forcing the context switch and you are getting the preliminary mask instead of the complete bitmap set mask.

g2gps commented 9 months ago

Thanks acassis, I just tried this out with the esp32c3-devkit:tickles configuration with,

CONFIG_RR_INTERVAL=0 (The default for this configuration)
CONFIG_RR_INTERVAL=20
CONFIG_RR_INTERVAL=200

In each case, I get a similar result.

I was having a look at vfs/fs_timerfd.c in timerfd_settime:

int timerfd_settime(int fd, int flags,
                    FAR const struct itimerspec *new_value,
                    FAR struct itimerspec *old_value)
{
  FAR struct timerfd_priv_s *dev;
  FAR struct file *filep;
  irqstate_t intflags;
  sclock_t delay;
  int ret;

  /* Some sanity checks */

  if (!new_value)
    {
      ret = -EFAULT;
      goto errout;
    }

  if ((flags & ~TFD_TIMER_ABSTIME) != 0)
    {
      ret = -EINVAL;
      goto errout;
    }

  /* Get file pointer by file descriptor */

  ret = fs_getfilep(fd, &filep);
  if (ret < 0)
    {
      goto errout;
    }

  /* Check fd come from us */

  if (!filep->f_inode || filep->f_inode->u.i_ops != &g_timerfd_fops)
    {
      ret = -EINVAL;
      goto errout;
    }

  dev = (FAR struct timerfd_priv_s *)filep->f_priv;

  /* Disable interrupts here to ensure that expiration counter is accessed
   * atomicaly.
   */

  intflags = enter_critical_section();

  if (old_value)
    {
      /* Get the number of ticks before the underlying watchdog expires */

      delay = wd_gettime(&dev->wdog);

      /* Convert that to a struct timespec and return it */

      clock_ticks2time(delay, &old_value->it_value);
      clock_ticks2time(dev->delay, &old_value->it_interval);
    }

  /* Disarm the timer (in case the timer was already armed when
   * timerfd_settime() is called).
   */

  wd_cancel(&dev->wdog);

  /* Clear expiration counter */

  dev->counter = 0;

  /* If the it_value member of value is zero, the timer will not be
   * re-armed
   */

  if (new_value->it_value.tv_sec <= 0 && new_value->it_value.tv_nsec <= 0)
    {
      leave_critical_section(intflags);
      return OK;
    }

  /* Setup up any repetitive timer */

  clock_time2ticks(&new_value->it_interval, &delay);
  dev->delay = delay;

  /* We need to disable timer interrupts through the following section so
   * that the system timer is stable.
   */

  /* Check if abstime is selected */

  if ((flags & TFD_TIMER_ABSTIME) != 0)
    {
      /* Calculate a delay corresponding to the absolute time in 'value' */

      clock_abstime2ticks(dev->clock, &new_value->it_value, &delay);
    }
  else
    {
      /* Calculate a delay assuming that 'value' holds the relative time
       * to wait.  We have internal knowledge that clock_time2ticks always
       * returns success.
       */

      clock_time2ticks(&new_value->it_value, &delay);
    }

  /* If the time is in the past or now, then set up the next interval
   * instead (assuming a repetitive timer).
   */

  if (delay <= 0)
    {
      delay = dev->delay;
    }

  /* Then start the watchdog */

  ret = wd_start(&dev->wdog, delay, timerfd_timeout, (wdparm_t)dev);
  if (ret < 0)
    {
      leave_critical_section(intflags);
      goto errout;
    }

  leave_critical_section(intflags);
  return OK;

errout:
  set_errno(-ret);
  return ERROR;
}

There would be finite time between when the number of ticks for the absolute time expiry is calculated, and when it's added to the list of activate watchdogs, in wd_start. Given that in tickless mode, the underlying timebase is till running, could this not cause the watchdog timeout tick to be sqewed?

rtucker85 commented 9 months ago

Hi @g2gps, @acassis

I was able to get my hands on an stm32f4discovery and test it on that platform. Same issue. See below:

nsh> poll
Start poll
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F

acassis commented 9 months ago

Thanks acassis, I just tried this out with the esp32c3-devkit:tickles configuration with,

* CONFIG_RR_INTERVAL=0 (The default for this configuration)

* CONFIG_RR_INTERVAL=20

* CONFIG_RR_INTERVAL=200

In each case, I get a similar result.

I was having a look at vfs/fs_timerfd.c in timerfd_settime:

int timerfd_settime(int fd, int flags,
                    FAR const struct itimerspec *new_value,
                    FAR struct itimerspec *old_value)
{
  FAR struct timerfd_priv_s *dev;
  FAR struct file *filep;
  irqstate_t intflags;
  sclock_t delay;
  int ret;

  /* Some sanity checks */

  if (!new_value)
    {
      ret = -EFAULT;
      goto errout;
    }

  if ((flags & ~TFD_TIMER_ABSTIME) != 0)
    {
      ret = -EINVAL;
      goto errout;
    }

  /* Get file pointer by file descriptor */

  ret = fs_getfilep(fd, &filep);
  if (ret < 0)
    {
      goto errout;
    }

  /* Check fd come from us */

  if (!filep->f_inode || filep->f_inode->u.i_ops != &g_timerfd_fops)
    {
      ret = -EINVAL;
      goto errout;
    }

  dev = (FAR struct timerfd_priv_s *)filep->f_priv;

  /* Disable interrupts here to ensure that expiration counter is accessed
   * atomicaly.
   */

  intflags = enter_critical_section();

  if (old_value)
    {
      /* Get the number of ticks before the underlying watchdog expires */

      delay = wd_gettime(&dev->wdog);

      /* Convert that to a struct timespec and return it */

      clock_ticks2time(delay, &old_value->it_value);
      clock_ticks2time(dev->delay, &old_value->it_interval);
    }

  /* Disarm the timer (in case the timer was already armed when
   * timerfd_settime() is called).
   */

  wd_cancel(&dev->wdog);

  /* Clear expiration counter */

  dev->counter = 0;

  /* If the it_value member of value is zero, the timer will not be
   * re-armed
   */

  if (new_value->it_value.tv_sec <= 0 && new_value->it_value.tv_nsec <= 0)
    {
      leave_critical_section(intflags);
      return OK;
    }

  /* Setup up any repetitive timer */

  clock_time2ticks(&new_value->it_interval, &delay);
  dev->delay = delay;

  /* We need to disable timer interrupts through the following section so
   * that the system timer is stable.
   */

  /* Check if abstime is selected */

  if ((flags & TFD_TIMER_ABSTIME) != 0)
    {
      /* Calculate a delay corresponding to the absolute time in 'value' */

      clock_abstime2ticks(dev->clock, &new_value->it_value, &delay);
    }
  else
    {
      /* Calculate a delay assuming that 'value' holds the relative time
       * to wait.  We have internal knowledge that clock_time2ticks always
       * returns success.
       */

      clock_time2ticks(&new_value->it_value, &delay);
    }

  /* If the time is in the past or now, then set up the next interval
   * instead (assuming a repetitive timer).
   */

  if (delay <= 0)
    {
      delay = dev->delay;
    }

  /* Then start the watchdog */

  ret = wd_start(&dev->wdog, delay, timerfd_timeout, (wdparm_t)dev);
  if (ret < 0)
    {
      leave_critical_section(intflags);
      goto errout;
    }

  leave_critical_section(intflags);
  return OK;

errout:
  set_errno(-ret);
  return ERROR;
}

There would be finite time between when the number of ticks for the absolute time expiry is calculated, and when it's added to the list of activate watchdogs, in wd_start. Given that in tickless mode, the underlying timebase is till running, could this not cause the watchdog timeout tick to be sqewed?

Yes, I think it could be the root cause!

@xiaoxiang781216 @patacongo @phfbertoleti any idea or suggestions how to fix it?

acassis commented 9 months ago

Hi @g2gps, @acassis

I was able to get my hands on an stm32f4discovery and test it on that platform. Same issue. See below:
nsh> poll
Start poll
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F
mask 0xE0
mask 0x1F

@rtucker85 thank you for confirming!

anchao commented 4 months ago

@g2gps Have you tested this issue on latest master? I can't reproduce on qemu-armv8a:nsh_smp_tickless

g2gps commented 4 months ago

I retested this today with NuttX main.

The issue is not present with qemu-armv8a:nsh_smp_tickless. Although I still have the same failed result with the esp32c3-devkit:tickless and Litex based configuration.

If I checkout NuttX at 533ccb184736587b0915f928e2de79ccaf82d223, which was the current revision when this issue was reported, the test fails on qemu-armv8a:nsh_smp_tickless as well.

I'm going to have a quick look through to see if I can find any obvious changes specific to the arm64 port, which may have resolved this. Perhaps someone else has some background here?

g2gps commented 4 months ago

I've added a branch to our fork of nuttx-apps (https://github.com/motec-research/nuttx-apps/tree/issues/11189) and nuttx (https://github.com/motec-research/incubator-nuttx/tree/issues/11189). With the example program and configuration added. Hopefully it makes trying this out a bit easier.

g2gps commented 3 months ago

I've found the cause of the issue. Both esp32c3 and the Litex port use the simple interval timer for tickless events. If I change the tickless implementation to support CONFIG_SCHED_TICKLESS_ALARM, poll works as expected.

I can push a PR for the Litex port once I get the implementation tidied up. However, I assume that this issue would still exist for other ports.

apache / nuttx

Unexpected behaviour with poll and tickless scheduler #11189