ClusterLabs / libqb

libqb is a library providing high performance logging, tracing, ipc, and poll.
http://clusterlabs.github.io/libqb/
GNU Lesser General Public License v2.1
165 stars 97 forks source link

Retry if posix_fallocate is interrupted with EINTR #453

Closed shastah closed 2 years ago

shastah commented 2 years ago

Every now and then Pacemaker reports errors:

  (pcmk__new_client)        debug: New IPC client 3efdbecf-c2d9-44bc-b4a6-9bcd48021ba1 for PID 27492 with uid 0 and gid 0
  (handle_new_connection)   debug: IPC credentials authenticated (/dev/shm/qb-7271-27492-12-hfPbKY/qb)
  (qb_ipcs_shm_connect)     debug: connecting to client [27492]
  (qb_rb_open_2)    debug: shm size:524301; real_size:528384; rb->word_size:132096
  (qb_rb_open_2)    debug: shm size:524301; real_size:528384; rb->word_size:132096
  (qb_sys_mmap_file_open)   error: couldn't allocate file /dev/shm/qb-7271-27492-12-hfPbKY/qb-event-cib_rw-data: Interrupted system call (4)
  (qb_rb_open_2)    error: couldn't create file for mmap
  (qb_ipcs_shm_rb_open)     error: qb_rb_open:/dev/shm/qb-7271-27492-12-hfPbKY/qb-event-cib_rw: Interrupted system call (4)
  (qb_rb_close_helper)      debug: Free'ing ringbuffer: /dev/shm/qb-7271-27492-12-hfPbKY/qb-response-cib_rw-header
  (qb_rb_close_helper)      debug: Free'ing ringbuffer: /dev/shm/qb-7271-27492-12-hfPbKY/qb-request-cib_rw-header
  (qb_ipcs_shm_connect)     error: shm connection FAILED: Interrupted system call (4)
  (handle_new_connection)   error: Error in connection setup (/dev/shm/qb-7271-27492-12-hfPbKY/qb): Interrupted system call (4)

While it probably might be addressed in Pacemaker code, a simple retry loop in case posix_fallocate(3) returns EINTR seems to be a decent workaround.

Fixes: #451

Signed-off-by: Jakub Jankowski shasta@toxcorp.com

knet-ci-bot commented 2 years ago

Can one of the admins verify this patch?

gao-yan commented 2 years ago

Quite some users are encountering the same issue, where requests to pacemaker-based daemon fail on IPC at times.

It seems sensible to do retrying under the situation.

For the case of ! HAVE_POSIX_FALLOCATE in below, a noticeable difference is, if the situation continues occurring, it keeps retrying write() rather than for up to a limited amount of times though. So a question is if it makes sense to do the same with posix_fallocate().

chrissie-c commented 2 years ago

test this please