ClusterLabs / libqb

libqb is a library providing high performance logging, tracing, ipc, and poll.
http://clusterlabs.github.io/libqb/
GNU Lesser General Public License v2.1
165 stars 97 forks source link

posix_fallocate can be interrupted by a signal #451

Closed shastah closed 2 years ago

shastah commented 2 years ago

Every now and then, use of any Pacemaker cmdline tool fails and results in Pacemaker logging errors:

  (pcmk__new_client)        debug: New IPC client 3efdbecf-c2d9-44bc-b4a6-9bcd48021ba1 for PID 27492 with uid 0 and gid 0
  (handle_new_connection)   debug: IPC credentials authenticated (/dev/shm/qb-7271-27492-12-hfPbKY/qb)
  (qb_ipcs_shm_connect)     debug: connecting to client [27492]
  (qb_rb_open_2)    debug: shm size:524301; real_size:528384; rb->word_size:132096
  (qb_rb_open_2)    debug: shm size:524301; real_size:528384; rb->word_size:132096
  (qb_sys_mmap_file_open)   error: couldn't allocate file /dev/shm/qb-7271-27492-12-hfPbKY/qb-event-cib_rw-data: Interrupted system call (4)
  (qb_rb_open_2)    error: couldn't create file for mmap
  (qb_ipcs_shm_rb_open)     error: qb_rb_open:/dev/shm/qb-7271-27492-12-hfPbKY/qb-event-cib_rw: Interrupted system call (4)
  (qb_rb_close_helper)      debug: Free'ing ringbuffer: /dev/shm/qb-7271-27492-12-hfPbKY/qb-response-cib_rw-header
  (qb_rb_close_helper)      debug: Free'ing ringbuffer: /dev/shm/qb-7271-27492-12-hfPbKY/qb-request-cib_rw-header
  (qb_ipcs_shm_connect)     error: shm connection FAILED: Interrupted system call (4)
  (handle_new_connection)   error: Error in connection setup (/dev/shm/qb-7271-27492-12-hfPbKY/qb): Interrupted system call (4)

The way we worked around it is with a silly "retry" logic in qb_sys_mmap_file_open(). Not sure if this is the best solution, but it proved to be "good enough":

--- a/lib/unix.c    2020-07-29 08:37:35.000000000 +0200
+++ b/lib/unix.c    2021-01-28 21:01:43.984432610 +0100
@@ -79,6 +79,9 @@
    ssize_t written;
    char *buffer = NULL;
    char *is_absolute = strchr(file, '/');
+#ifdef HAVE_POSIX_FALLOCATE
+   int32_t fallocate_retry = 5;
+#endif

    if (is_absolute) {
        (void)strlcpy(path, file, PATH_MAX);
@@ -116,12 +119,22 @@
    }

 #ifdef HAVE_POSIX_FALLOCATE
-   if ((res = posix_fallocate(fd, 0, bytes)) != 0) {
-       errno = res;
-       res = -1 * res;
-       qb_util_perror(LOG_ERR, "couldn't allocate file %s", path);
-       goto unlink_exit;
-   }
+   /* posix_fallocate(3) can be interrupted by a signal,
+      so retry few times before giving up */
+   do {
+       fallocate_retry--;
+       res = posix_fallocate(fd, 0, bytes);
+       if (res == EINTR) {
+           qb_util_log(LOG_DEBUG, "got EINTR trying to allocate file %s, retrying...", path);
+           continue;
+       } else if (res != 0) {
+           errno = res;
+           res = -1 * res;
+           qb_util_perror(LOG_ERR, "couldn't allocate file %s", path);
+           goto unlink_exit;
+       }
+       break;
+   } while (fallocate_retry > 0);
 #else
    if (file_flags & O_CREAT) {
        long page_size = sysconf(_SC_PAGESIZE);
chrissie-c commented 2 years ago

I had a careful think about this and at first I was against the idea, but as this function is called several times for each connection and there is a similar loop around write below it, I think it's quite reasonable.

Is it possible to send me a PR with the patch please?