cvmfs / cvmfs

The CernVM File System
http://cernvm.cern.ch/portal/filesystem
BSD 3-Clause "New" or "Revised" License
291 stars 131 forks source link

Race in symlink cache invalidation #3626

Open mharvey-jt opened 3 months ago

mharvey-jt commented 3 months ago

there's a race between cvmfs sending an expiry for a symlink and it beginning to return the new value. This means that a readlink() can end up retrieving the old value which then sticks around indefinitely in the kernel cache. The window seems to be a few hundred ms. Reproducer:

compile and run:

#include <stdio.h>
#include <unistd.h>
#include <string.h>
void main(void){

  char buf[1024];
  char ref[1024];
  ref[0]=0;
 while(1) {
  memset(buf,0,1024); 
  ssize_t l = readlink("/cvmfs/repo/LINK", buf, 1024);
  if(l !=strlen(buf) ) {
    printf("Length mismatch %d != %d\n", l, strlen(buf));
  }
  if(strcmp(buf, ref) || strlen(buf)!=strlen(ref)) {
   printf("%s\n", buf);
   strcpy(ref,buf);
  } 
  sleep(1); // REMOVE ME

 }
}

then mutate the symlink /cvmfs/repo/LINK. you should see invalidation working as expected. Now remove the sleep(1) and try again - you should see the reported link value not change as expected. This can be further verified by instrumenting cvmfs_readlink() to log what is being returned.

HereThereBeDragons commented 3 months ago

I can confirm there is odd behavior. I extended matts script to include timestamp, revision and a loop counter. revision is read before readlink and printed if it is different to the previous one. (if wanted can share the script)

what you see below is the output when the sleep is removed and symlink caching is active. (in all other premutations of those 2 options it works correctly)

server was publishing a new revision (changing the symlink) every 13 sec, the client reloads with minimum TTL of 1min.

the observed behavior is different to the first issue #2949 when symlink caching is enabled but expire did not exist.

for #2949 new symlinks paths would be cut short if the old symlink paths were shorter. this new issue here observes the opposite: new symlinks have the correct new symlink path length but show the old symlink path.

investigations are continued.. might be something due to the inode caching...

Fri Jul  5 13:51:44 2024
 0  revision 688
Fri Jul  5 13:51:44 2024
 0  /cvmfs/my2nd.test.repo/symlinktest/fileLLLLLLLLLLLLLLLLLLLL.txt

Fri Jul  5 13:52:30 2024
 2752452    revision 692
Fri Jul  5 13:52:30 2024
 2752452    /cvmfs/my2nd.test.repo/symlinktest/fileLL

Fri Jul  5 13:53:30 2024
 6416399    /cvmfs/my2nd.test.repo/symlinktest/so.txt
Fri Jul  5 13:53:30 2024
 6416400    revision 697

Fri Jul  5 13:54:30 2024
 10043739   /cvmfs/my2nd.test.repo/symlinktest/myfile.txt
Fri Jul  5 13:54:30 2024
 10043740   revision 701
Fri Jul  5 13:54:30 2024
 10043740   /cvmfs/my2nd.test.repo/symlinktest/myfile.t

Fri Jul  5 13:55:30 2024
 13674448   /cvmfs/my2nd.test.repo/symlinktest/file.txt
Fri Jul  5 13:55:30 2024
 13674449   revision 706

Fri Jul  5 13:56:30 2024
 17306478   /cvmfs/my2nd.test.repo/symlinktest/fileLLLLLLLLLLLLLLLLLLLL.txt
Fri Jul  5 13:56:30 2024
 17306479   revision 710
Fri Jul  5 13:56:30 2024
 17306479   /cvmfs/my2nd.test.repo/symlinktest/fileLL
<end of program that changes the symlink>

<manual reload>
Fri Jul  5 14:00:49 2024
 32921132   /cvmfs/my2nd.test.repo/symlinktest/so.txt
HereThereBeDragons commented 3 months ago

Some further investigation reveal the problem: When loading a new catalog revision the following steps happen:

  1. There is a new revision
  2. Turn of kernel caching (dentry timeout = 0)
  3. Invalidate kernel caches (dentry + inodes) << readlink here will create problems
  4. Apply new revision
  5. Turn caching back on

The problem is when a readlink request comes between 3. and 4., then the inode kernel cache will be filled with the old inode. dentry will not be stored due to timeout = 0. As such the result is that the symlink can get in a broken state where the symlink points to the old filename but is truncated to the new symlink length.

There are possible approaches, but need a bit thinking:

HereThereBeDragons commented 2 months ago

so i did some more research.. and sent a mail to fuse-devel. thread to follow: https://sourceforge.net/p/fuse/mailman/message/58806569/

i will just copy the important part of the mail here and refer from the email to here as this here allows for nicer formatting, and all but the introduction to cvmfs is also relevant here.

TL;DR Symlink corruption (new value, old length) can occur if the target of a symlink changes; symlink inode stays the same; expire is used for dentries and inval for inodes. It is most likely a kernel issue as userland returns the correct new symlink.

Why it is able to exit early is not clear to me.


Before continuing to describe the problem, here a short overview of our file system:

CernVM-FS [2] is a snapshot-based read-only, distributed file system. The file system does not have genuine inodes but "cvmfs inodes" that are artificially issued by the fuse module to allow keep track of and serve correctly open files during different generations of snapshots (an open file from snapshot A should be still working on the snapshot A data, while a new request should work on the current snapshot D).

When a new snapshot becomes available the following steps are performed:

  1. Stop Kernel caching (new dentries timeout = 0)
  2. Asynchronously evict kernel caches (inval inodes, expire dentries - we have trackers for it to know what the kernel has in its caches)
  3. Apply new snapshot
    • this is a critical section and will stop any execution of critical sections within all posix functions (open, readlink, getatt, lookup, read, ...)
    • mutex to stop and drain readlink requests (wait for drain, and have an extra stop within readlink before the critical section)
    • evict inodes again
    • apply new snapshot
    • restart readlink requests
    • end critical section
  4. Turn back on kernel caching

During all those steps, requests from the user can and will come in. As such, we do not want to fail lookup/open/readlink requests if they are valid requests on the new snapshot (even if the inode is "old"). For open/read etc we have mechanics to figure out if to work on stale cvmfs inodes or replace them with new ones.


While the event itself is randomly happening, we have a reproducer for our file system that can reliably reproduce it within a couple of minutes. It performs as quickly as possible a readlink on my-symlink while new snapshots are applied that change the target my-symlink points to. The targets have different length and point to valid files.

The mutex around the readlink decreases the likelyhood of returning a corrupted symlink but does not fully solve it. Adding sleeps did not change anything.

Note:

Debugging results

1) This problem only occurs if readlink-calls are performed in a very tight loop during application of a new snapshot. (e.g. when adding a sleep(1) inside the loop i do not seem to be able to trigger this issue) 2) On the userland everything seems to work correctly:

For the logs:

Good log

   08/15/24 11:35:34.959253 fuse_reverse_inval_inode:    func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:34.959258 fuse_invalidate_attr:        func cvmfs2             inode-inode 270 inode-size 22

<< stop caching here>>
   08/15/24 11:35:35.984689 fuse_dentry_settime:         func cvmfs2             d_time 0 ino 270 new_time 0
   08/15/24 11:35:35.984784 fuse_reverse_inval_inode:    func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:35.984791 fuse_invalidate_attr:        func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:35.984855 fuse_reverse_inval_inode:    func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:35.984859 fuse_invalidate_attr:        func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:36.086263 fuse_dentry_revalidate:      func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:36.086284 fuse_dentry_revalidate:      func corrupt_symlink    d_time 0 ino 270 size 22 flags 16384
   08/15/24 11:35:36.086566 fuse_invalid_attr:           func corrupt_symlink    ino 270 size 16 blksize 4096

   08/15/24 11:35:36.086579 fuse_change_attributes:  func corrupt_symlink \
                                                      inode: ino 270, size 22, version.counter 0, state 0 \
                                                      fuse_attr: ino 270, size 16, blksize 4096, \
                                                      fuse_statx: ino 0, size 0, blksize 0, \
                                                      fuse_inode: ino 0, orig_ino 0, attr_version 0 state 0 \
                                                      attr_valid 98, attr_version 0

   08/15/24 11:35:36.086584 fuse_change_attributes_common: func corrupt_symlink 
                                                            inode: ino 270, i_size 22, \
                                                            fuse_attr: ino 270, size 16, blksize 4096, \
                                                            fuse_statx: ino 0, size 0, blksize 0, \
                                                            attr_valid 0, cache_mask 0

<< restart caching here >>
   08/15/24 11:35:36.086588 fuse_dentry_settime:      func corrupt_symlink  d_time 0 ino 270 new_time 68641802
   08/15/24 11:35:36.086598 fuse_get_link:            func corrupt_symlink  dentry-inode 270 dentry-size 16 inode-inode 270 inode-size 16
   08/15/24 11:35:36.086612 fuse_readlink_page:       func corrupt_symlink  ino 0 inode 270 size 16

Current time: 08/15/24 09:35:36.086833735 UTC
SYMLINK CHANGED:    1402844 size 16 symlinktest/1a1b
<< correct value and size >>

   08/15/24 11:35:36.087078 fuse_dentry_revalidate:   func corrupt_symlink  d_time 0 ino 270 size 16 flags 16448
   08/15/24 11:35:36.087085 fuse_get_link:            func corrupt_symlink  dentry-inode 270 dentry-size 16 inode-inode 270 inode-size 16

Bad log - old length, new symlink too long

   08/15/24 11:35:11.883895 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:11.884372 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16384

<< stop caching here>>
   08/15/24 11:35:11.884238 fuse_dentry_settime:      func cvmfs2             d_time 0 ino 270 new_time 0
   08/15/24 11:35:11.884334 fuse_reverse_inval_inode: func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:11.884338 fuse_invalidate_attr:     func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:11.880132 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:11.880142 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:11.884041 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:11.884052 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:11.884169 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:11.884174 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:12.904539 fuse_reverse_inval_inode: func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:12.904554 fuse_invalidate_attr:     func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:12.904592 fuse_reverse_inval_inode: func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:12.904595 fuse_invalidate_attr:     func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:13.004875 fuse_invalid_attr:        func corrupt_symlink    ino 270 size 36 blksize 4096

   08/15/24 11:35:13.004891 fuse_change_attributes:  func corrupt_symlink \
                                                      inode: ino 270, size 22, version.counter 0, state 0 \
                                                      fuse_attr: ino 270, size 36, blksize 4096, \
                                                      fuse_statx: ino 0, size 0, blksize 0, \
                                                      fuse_inode: ino 0, orig_ino 0, attr_version 0, state 0 \
                                                      attr_valid 74, attr_version 0

<<<< fuse_change_attributes_common is missing >>>>

<< restart caching here >>
   08/15/24 11:35:13.004897 fuse_dentry_settime:      func corrupt_symlink    d_time 0 ino 270 new_time 68636032
   08/15/24 11:35:13.004907 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:13.004920 fuse_readlink_page:       func corrupt_symlink    ino 0 inode 270 size 22

Current time: 08/15/24 11:35:13.005221851 UTC
SYMLINK CHANGED:    936856  size 22 symlinktest/10bbbbbbbb
<< wrong value: symlink should point to symlinktest/10bbbbbbbbbb10cccccccccc with length 36 >>

   08/15/24 11:35:13.007225 fuse_dentry_revalidate:   func corrupt_symlink     d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:13.007244 fuse_get_link:            func corrupt_symlink     dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:13.007363 fuse_dentry_revalidate:   func corrupt_symlink     d_time 0 ino 270 size 22 flags 16448

Bad log - old length, new symlink too short (user will not see that mismatch due to null terminator)

   08/15/24 11:35:00.338840 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:00.338888 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:00.338891 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22

<< stop caching here>>
   08/15/24 11:35:00.339018 fuse_dentry_settime:      func cvmfs2             d_time 0 ino 270 new_time 0
   08/15/24 11:35:00.339076 fuse_reverse_inval_inode: func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:00.339078 fuse_invalidate_attr:     func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:00.339110 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16384
   08/15/24 11:35:01.364445 fuse_reverse_inval_inode: func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:01.364457 fuse_invalidate_attr:     func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:01.364484 fuse_reverse_inval_inode: func cvmfs2             ino 270 offset 0 len 0
   08/15/24 11:35:01.364488 fuse_invalidate_attr:     func cvmfs2             inode-inode 270 inode-size 22
   08/15/24 11:35:01.464772 fuse_invalid_attr:        func corrupt_symlink    ino 270 size 14 blksize 4096

   08/15/24 11:35:01.464796 fuse_change_attributes:  func corrupt_symlink \
                                                      inode: ino 270, size 22, version.counter 0, state 0, \
                                                      fuse_attr: ino 270, size 14, blksize 4096, \
                                                      fuse_statx: ino 0, size 0, blksize 0, \
                                                      fuse_inode: ino 0, orig_ino 0, attr_version 0, state 0 \
                                                      attr_valid 65, attr_version 0

<<<< fuse_change_attributes_common is missing >>>>

<< restart caching here >>
   08/15/24 11:35:01.464803 fuse_dentry_settime:      func corrupt_symlink    d_time 0 ino 270 new_time 68633147
   08/15/24 11:35:01.464815 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:01.464836 fuse_readlink_page:       func corrupt_symlink    ino 0 inode 270 size 22

Current time: 08/15/24 11:35:01.465255963 UTC
SYMLINK CHANGED:    703167  size 14 symlinktest/fg
<< correct value but kernel works with too large size 22 instead of 14 >>

   08/15/24 11:35:01.466162 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448
   08/15/24 11:35:01.466174 fuse_get_link:            func corrupt_symlink    dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
   08/15/24 11:35:01.466447 fuse_dentry_revalidate:   func corrupt_symlink    d_time 0 ino 270 size 22 flags 16448

My understanding from the logs

Looking at the good and bad logs, the only difference is fuse_change_attributes_common() not being called from fuse_change_attributes(). This is only possible if the following branch is being taken:

if ((attr_version != 0 && fi->attr_version > attr_version) ||
        test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
        spin_unlock(&fi->lock);
        return;
    }

However, looking at bpftrace (see code being used below [B]), I do not understand how this branch can be taken as in the logs it is stated: attr_version == 0 and &fi->state == 0 with struct fuse_inode *fi = get_fuse_inode(inode);

I also have looked where set_bit(FUSE_I_SIZE_UNSTABLE, is called and attached to the kprobes of it. But none of them got triggered.

The complete logs and bpftrace script can be found here: https://cernbox.cern.ch/s/EBVedOKe6ng4O0H

[0] https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/0158a70b-3d78-ce18-f38f-b8563d9efcdd@cern.ch/ [1] https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/fbb2f999-6b37-5875-deca-989a75ce3b5b@spawn.link/ [2] https://github.com/cvmfs/cvmfs

[A]

/*
 * Check whether the dentry is still valid
 *
 * If the entry validity timeout has expired and the dentry is
 * positive, try to redo the lookup.  If the lookup results in a
 * different inode, then let the VFS invalidate the dentry and redo
 * the lookup once more.  If the lookup results in the same inode,
 * then refresh the attributes, timeouts and mark the dentry valid.
 */
static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)

[B] (bpftrace version v0.14.0, ran on platform 6.5.0-26-generic #26~22.04.1-Ubuntu)

kprobe:fuse_change_attributes
/ ((struct inode *) arg0)->i_ino == 270 / {
  $x = nsecs;
  printf("%s fuse_change_attributes:\tfunc %s inode_ino %ld inode_size %ld, inode_version.counter %ld, inode_state %lu fuse_attr ino %ld, fuse_attr size, fuse_attr %ld blksize %ld, fuse_statx ino %ld, fuse_statx size %ld fuse_statx blksize %ld, fuse_inode ino %ld,  fuse_inode orig_ino %lu, fuse_inode attr_version %lu fuse_inode state %ld -- attr_valid %lu, attr_version %lu\n",
          strftime("%D %T.%f", $x), comm,
          ((struct inode *) arg0)->i_ino,
          ((struct inode *) arg0)->i_size,
          ((struct inode *) arg0)->i_version.counter,
          ((struct inode *) arg0)->i_state,
          ((struct fuse_attr*)arg1)->ino,
          ((struct fuse_attr*)arg1)->size,
          ((struct fuse_attr*)arg1)->blksize,
          ((struct fuse_statx*)arg2)->ino,
          ((struct fuse_statx*)arg2)->size,
          ((struct fuse_statx*)arg2)->blksize,
          ((struct fuse_inode *)((struct inode *) arg0)->i_private)->nodeid,
          ((struct fuse_inode *)((struct inode *) arg0)->i_private)->orig_ino,
          ((struct fuse_inode *)((struct inode *) arg0)->i_private)->attr_version,
          ((struct fuse_inode *)((struct inode *) arg0)->i_private)->state,
          arg3, arg4
        );
}