prefaulter's child process hangs onto xlog files after database is shut down

davepacheco commented 7 years ago

We've found another case where the prefaulter implicitly hangs onto references to a database's filesystem after the database itself has shut down. This is a problem for Manatee because Manatee attempts to unmount and mount the filesystem as part of starting the database (itself a separate problem), but it can't unmount it while the prefaulter is holding these references. This is a similar problem as in #13, but a different way that it can happen.

In this case, there was only one process holding open a file on the database's dataset:

[root@fa0e1075 (postgres) ~]$ fuser -c /manatee/pg
/manatee/pg:     8115o

It's a pg_xlogdump process forked by the prefaulter:

[root@HA99RHND2 (eu-central-1b) ~]# ptree 8115
892   zsched
  975   /sbin/init
    7127  /opt/pg_prefaulter/pg_prefaulter run --config=/opt/pg_prefaulter/pg_prefau
      8115  /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/0000

It's not holding a file via its working directory:

[root@fa0e1075 (postgres) ~]$ pwdx 8115
8115:   /root

But rather it seems to be one of its open fds:

[root@HA99RHND2 (eu-central-1b) ~]# pfiles 8115
8115:   /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00
  Current rlimit: 65536 file descriptors
   0: S_IFCHR mode:0666 dev:561,8 ino:3001421048 uid:0 gid:3 rdev:38,2
      O_RDONLY|O_LARGEFILE
      /zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/dev/null
      offset:0
   1: S_IFIFO mode:0000 dev:558,0 ino:29756267 uid:0 gid:0 rdev:0,0
      O_RDWR
   2: S_IFIFO mode:0000 dev:558,0 ino:29756268 uid:0 gid:0 rdev:0,0
      O_RDWR
   3: S_IFREG mode:0600 dev:90,65658 ino:65565 uid:907 gid:0 size:16777216
      O_RDONLY|O_LARGEFILE
      offset:14827520

Unfortunately we don't have the path for fd 3, but it seems almost certainly the WAL file that was given as an argument to the program:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# pargs core.8115 
core 'core.8115' of 8115:   /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00000001000
argv[0]: /opt/postgresql/current/bin/pg_xlogdump
argv[1]: -f
argv[2]: /manatee/pg/data/pg_xlog/000000010000118600000084

The inode number reported by stat on that file matches the one that pfiles reported:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# stat /zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/manatee/pg/data/pg_xlog/000000010000118600000084
  File: `/zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/manatee/pg/data/pg_xlog/000000010000118600000084'
  Size: 16777216    Blocks: 14213      IO Block: 8192   regular file
Device: 169007ah/23658618d  Inode: 65565       Links: 1
Access: (0600/-rw-------)  Uid: (  907/ UNKNOWN)   Gid: (    0/    root)
Access: 2017-10-13 16:34:10.812840776 +0000
Modify: 2017-10-13 16:34:44.373763334 +0000
Change: 2017-10-13 16:34:44.373763334 +0000
 Birth: -

The "-f" flag appears to block when it reaches the end of valid WAL data. And it does look like we're at the end:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# ls -l /zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/manatee/pg/data/pg_xlog/ | tail
-rw-------   1 907      root     16777216 Oct 13 16:33 00000001000011860000007D
-rw-------   1 907      root     16777216 Oct 13 16:33 00000001000011860000007E
-rw-------   1 907      root     16777216 Oct 13 16:33 00000001000011860000007F
-rw-------   1 907      root     16777216 Oct 13 16:33 000000010000118600000080
-rw-------   1 907      root     16777216 Oct 13 16:33 000000010000118600000081
-rw-------   1 907      root     16777216 Oct 13 16:33 000000010000118600000082
-rw-------   1 907      root     16777216 Oct 13 16:34 000000010000118600000083
-rw-------   1 907      root     16777216 Oct 13 16:34 000000010000118600000084
-rw-------   1 907      root     16777216 Oct 13 05:46 000000010000118600000085
drwx------   2 907      907        20002 Oct 13 16:35 archive_status

There's one more file, but it has a much older mtime -- it looks like that might be a recycled file. It's not valid at all:

[root@fa0e1075 (postgres) /manatee/pg/data/pg_xlog]$ pg_xlogdump 000000010000118600000085 
pg_xlogdump: FATAL:  could not find a valid record after 1186/85000000

and xxd shows that file "84" ends in all zeros, which suggests it's all invalid:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# xxd -a 000000010000118600000084 | tail
...
0e22000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0fffff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Offset e22000 is 14819328, just 8K behind where the pg_xlogdump process was in the pfiles output. And pg_xlogdump is in a loop sleeping from main():

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# pstack 8115
8115:   /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00
 ffffbf7fff25532a pollsys  (ffffbf7fffdfd6a0, 0, ffffbf7fffdfd790, 0)
 ffffbf7fff1ea27b pselect (0, 0, 0, 0, ffffbf7fffdfd790, 0) + 26b
 ffffbf7fff1ea62a select (0, 0, 0, 0, ffffbf7fffdfd7f0) + 5a
 00000000004102cb pg_usleep () + 92
 000000000040853c main () + a5e
 000000000040603c _start () + 6c
[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# truss -p 8115
pollsys(0xFFFFBF7FFFDFD6A0, 0, 0xFFFFBF7FFFDFD790, 0x00000000) = 0
lseek(3, 0, SEEK_SET)               = 0
read(3, "93D007\001\0\0\0\0\0\084".., 8192) = 8192
lseek(3, 14811136, SEEK_SET)            = 14811136
read(3, "93D005\001\0\0\0\0\0E284".., 8192) = 8192
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 8192) = 8192

which is clearly what it does when it reaches the end of the WAL file:

1007         for (;;)
1008         {
1009                 /* try to read the next record */
1010                 record = XLogReadRecord(xlogreader_state, first_record, &errormsg);
1011                 if (!record)
1012                 {
1013                         if (!config.follow || private.endptr_reached)
1014                                 break;
1015                         else
1016                         {
1017                                 pg_usleep(1000000L);    /* 1 second */
1018                                 continue;
1019                         }
1020                 }

On timing, this process started at 16:34:11:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# ps -opid,stime,args  -p 8115
  PID    STIME COMMAND
 8115 16:34:11 /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00000001000

and was still running until we killed it well after 16:58Z. The manatee-sitter log shows that it was at 16:40 that we brought down postgres, and shortly after that that we started it again.

I have saved:

core file, pfiles output, and the raw binary from pid 8115 (pg_xlogdump)
core file, pfiles output, and the raw binary from pid 7127 (pg_prefaulter)
the manatee-sitter log showing a timeline of events leading up to this.
the WAL file itself, 000000010000118600000084

I will tar all of these up and upload them to "/thoth/stor/tickets/pg_prefaulter#40" in the JPC Manta.

davepacheco commented 7 years ago

I've also marked the core files for pick-up by thoth. They should be tagged with ticket "pg_prefaulter#40".

sean- commented 7 years ago

Relevant snippet of ptree(1):

        9218  ./pg_prefaulter run --config=pg_prefaulter-follower.toml --retry-db-init
          10065 /opt/local/bin/pg_xlogdump -f /var/pgsql/go/src/github.com/joyent/pg_prefaulter

Term the follower:

$ pargs 10065
pargs: Couldn't determine locale of target process.
pargs: Some strings may not be displayed properly.
10065:  /opt/local/bin/pg_xlogdump -f /var/pgsql/go/src/github.com/joyent/pg_prefaulter
argv[0]: /opt/local/bin/pg_xlogdump
argv[1]: -f
argv[2]: /var/pgsql/go/src/github.com/joyent/pg_prefaulter/.pgdata_follower/pg_xlog/000000010000000000000017

ptree(1) reports no child processes. Watching ptree(1) in a tight loop shows that pg_xlogdump(1) is terminated after the expected 1s delay.

The prefaulter spawns new pg_xlogdump(1) commands as expected and terminates them within 1s. Tests performed under high write load (pgbench(1)) and while PostgreSQL was idle.

TritonDataCenter / pg_prefaulter

prefaulter's child process hangs onto xlog files after database is shut down #40