Open mharvey-jt opened 3 months ago
I can confirm there is odd behavior. I extended matts script to include timestamp, revision and a loop counter. revision is read before readlink and printed if it is different to the previous one. (if wanted can share the script)
what you see below is the output when the sleep is removed and symlink caching is active. (in all other premutations of those 2 options it works correctly)
server was publishing a new revision (changing the symlink) every 13 sec, the client reloads with minimum TTL of 1min.
the observed behavior is different to the first issue #2949 when symlink caching is enabled but expire did not exist.
for #2949 new symlinks paths would be cut short if the old symlink paths were shorter. this new issue here observes the opposite: new symlinks have the correct new symlink path length but show the old symlink path.
investigations are continued.. might be something due to the inode caching...
Fri Jul 5 13:51:44 2024
0 revision 688
Fri Jul 5 13:51:44 2024
0 /cvmfs/my2nd.test.repo/symlinktest/fileLLLLLLLLLLLLLLLLLLLL.txt
Fri Jul 5 13:52:30 2024
2752452 revision 692
Fri Jul 5 13:52:30 2024
2752452 /cvmfs/my2nd.test.repo/symlinktest/fileLL
Fri Jul 5 13:53:30 2024
6416399 /cvmfs/my2nd.test.repo/symlinktest/so.txt
Fri Jul 5 13:53:30 2024
6416400 revision 697
Fri Jul 5 13:54:30 2024
10043739 /cvmfs/my2nd.test.repo/symlinktest/myfile.txt
Fri Jul 5 13:54:30 2024
10043740 revision 701
Fri Jul 5 13:54:30 2024
10043740 /cvmfs/my2nd.test.repo/symlinktest/myfile.t
Fri Jul 5 13:55:30 2024
13674448 /cvmfs/my2nd.test.repo/symlinktest/file.txt
Fri Jul 5 13:55:30 2024
13674449 revision 706
Fri Jul 5 13:56:30 2024
17306478 /cvmfs/my2nd.test.repo/symlinktest/fileLLLLLLLLLLLLLLLLLLLL.txt
Fri Jul 5 13:56:30 2024
17306479 revision 710
Fri Jul 5 13:56:30 2024
17306479 /cvmfs/my2nd.test.repo/symlinktest/fileLL
<end of program that changes the symlink>
<manual reload>
Fri Jul 5 14:00:49 2024
32921132 /cvmfs/my2nd.test.repo/symlinktest/so.txt
Some further investigation reveal the problem: When loading a new catalog revision the following steps happen:
readlink
here will create problemsThe problem is when a readlink
request comes between 3. and 4., then the inode kernel cache will be filled with the old inode. dentry will not be stored due to timeout = 0. As such the result is that the symlink can get in a broken state where the symlink points to the old filename but is truncated to the new symlink length.
There are possible approaches, but need a bit thinking:
cvmfs_readlink
while the catalog update is in progress. this might have problems as the upload function waits for requests to be drainedso i did some more research.. and sent a mail to fuse-devel. thread to follow: https://sourceforge.net/p/fuse/mailman/message/58806569/
i will just copy the important part of the mail here and refer from the email to here as this here allows for nicer formatting, and all but the introduction to cvmfs is also relevant here.
TL;DR Symlink corruption (new value, old length) can occur if the target of a symlink changes; symlink inode stays the same; expire is used for dentries and inval for inodes. It is most likely a kernel issue as userland returns the correct new symlink.
fuse_change_attributes()
and fuse_change_attributes_common()
are called. fuse_change_attributes()
exits before fuse_change_attributes_common()
is called and as such the length stays the old one.Why it is able to exit early is not clear to me.
Before continuing to describe the problem, here a short overview of our file system:
CernVM-FS [2] is a snapshot-based read-only, distributed file system. The file system does not have genuine inodes but "cvmfs inodes" that are artificially issued by the fuse module to allow keep track of and serve correctly open files during different generations of snapshots (an open file from snapshot A should be still working on the snapshot A data, while a new request should work on the current snapshot D).
When a new snapshot becomes available the following steps are performed:
open, readlink, getatt, lookup, read,
...)During all those steps, requests from the user can and will come in. As such, we do not want to fail lookup/open/readlink
requests if they are valid requests on the new snapshot (even if the inode is "old"). For open/read etc we have mechanics to figure out if to work on stale cvmfs inodes or replace them with new ones.
While the event itself is randomly happening, we have a reproducer for our file system that can reliably reproduce it within a couple of minutes. It performs as quickly as possible a readlink on my-symlink
while new snapshots are applied that change the target my-symlink
points to. The targets have different length and point to valid files.
The mutex around the readlink decreases the likelyhood of returning a corrupted symlink but does not fully solve it. Adding sleeps did not change anything.
Note:
1) This problem only occurs if readlink-calls are performed in a very tight loop during application of a new snapshot. (e.g. when adding a sleep(1)
inside the loop i do not seem to be able to trigger this issue)
2) On the userland everything seems to work correctly:
cvmfs_readlink()
and the subsequent call to cvmfs_getattr()
return the correct new value and new length of the symlink. --> But this is ignored. (This can also be seen in fuse_change_attributes()
parameter fuse_attr
having the correct values)
3) It seems to be a kernel space issue:fuse_dentry_revalidate
(see the description below [A] )fuse_dentry_revalidate()
calls fuse_change_attributes()
fuse_change_attributes()
parmeters include inode
which has the current value, and fuse_attr
that contains the new values (with correct new length)fuse_change_attributes()
calls fuse_change_attributes_common()
which updates most of the attribute values of inode
to the values given in attr
. Afterwards fuse_change_attributes()
continues to update the length.
4) In corrupted symlink cases fuse_change_attributes()
exits before fuse_change_attributes_common()
is calledFor the logs:
<< some comment >>
SYMLINK CHANGED
output is from the user program corrupt_symlink
what the readlink()
returns 08/15/24 11:35:34.959253 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:34.959258 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
<< stop caching here>>
08/15/24 11:35:35.984689 fuse_dentry_settime: func cvmfs2 d_time 0 ino 270 new_time 0
08/15/24 11:35:35.984784 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:35.984791 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:35.984855 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:35.984859 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:36.086263 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:36.086284 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16384
08/15/24 11:35:36.086566 fuse_invalid_attr: func corrupt_symlink ino 270 size 16 blksize 4096
08/15/24 11:35:36.086579 fuse_change_attributes: func corrupt_symlink \
inode: ino 270, size 22, version.counter 0, state 0 \
fuse_attr: ino 270, size 16, blksize 4096, \
fuse_statx: ino 0, size 0, blksize 0, \
fuse_inode: ino 0, orig_ino 0, attr_version 0 state 0 \
attr_valid 98, attr_version 0
08/15/24 11:35:36.086584 fuse_change_attributes_common: func corrupt_symlink
inode: ino 270, i_size 22, \
fuse_attr: ino 270, size 16, blksize 4096, \
fuse_statx: ino 0, size 0, blksize 0, \
attr_valid 0, cache_mask 0
<< restart caching here >>
08/15/24 11:35:36.086588 fuse_dentry_settime: func corrupt_symlink d_time 0 ino 270 new_time 68641802
08/15/24 11:35:36.086598 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 16 inode-inode 270 inode-size 16
08/15/24 11:35:36.086612 fuse_readlink_page: func corrupt_symlink ino 0 inode 270 size 16
Current time: 08/15/24 09:35:36.086833735 UTC
SYMLINK CHANGED: 1402844 size 16 symlinktest/1a1b
<< correct value and size >>
08/15/24 11:35:36.087078 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 16 flags 16448
08/15/24 11:35:36.087085 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 16 inode-inode 270 inode-size 16
08/15/24 11:35:11.883895 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:11.884372 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16384
<< stop caching here>>
08/15/24 11:35:11.884238 fuse_dentry_settime: func cvmfs2 d_time 0 ino 270 new_time 0
08/15/24 11:35:11.884334 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:11.884338 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:11.880132 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:11.880142 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:11.884041 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:11.884052 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:11.884169 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:11.884174 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:12.904539 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:12.904554 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:12.904592 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:12.904595 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:13.004875 fuse_invalid_attr: func corrupt_symlink ino 270 size 36 blksize 4096
08/15/24 11:35:13.004891 fuse_change_attributes: func corrupt_symlink \
inode: ino 270, size 22, version.counter 0, state 0 \
fuse_attr: ino 270, size 36, blksize 4096, \
fuse_statx: ino 0, size 0, blksize 0, \
fuse_inode: ino 0, orig_ino 0, attr_version 0, state 0 \
attr_valid 74, attr_version 0
<<<< fuse_change_attributes_common is missing >>>>
<< restart caching here >>
08/15/24 11:35:13.004897 fuse_dentry_settime: func corrupt_symlink d_time 0 ino 270 new_time 68636032
08/15/24 11:35:13.004907 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:13.004920 fuse_readlink_page: func corrupt_symlink ino 0 inode 270 size 22
Current time: 08/15/24 11:35:13.005221851 UTC
SYMLINK CHANGED: 936856 size 22 symlinktest/10bbbbbbbb
<< wrong value: symlink should point to symlinktest/10bbbbbbbbbb10cccccccccc with length 36 >>
08/15/24 11:35:13.007225 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:13.007244 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:13.007363 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:00.338840 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:00.338888 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:00.338891 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
<< stop caching here>>
08/15/24 11:35:00.339018 fuse_dentry_settime: func cvmfs2 d_time 0 ino 270 new_time 0
08/15/24 11:35:00.339076 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:00.339078 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:00.339110 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16384
08/15/24 11:35:01.364445 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:01.364457 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:01.364484 fuse_reverse_inval_inode: func cvmfs2 ino 270 offset 0 len 0
08/15/24 11:35:01.364488 fuse_invalidate_attr: func cvmfs2 inode-inode 270 inode-size 22
08/15/24 11:35:01.464772 fuse_invalid_attr: func corrupt_symlink ino 270 size 14 blksize 4096
08/15/24 11:35:01.464796 fuse_change_attributes: func corrupt_symlink \
inode: ino 270, size 22, version.counter 0, state 0, \
fuse_attr: ino 270, size 14, blksize 4096, \
fuse_statx: ino 0, size 0, blksize 0, \
fuse_inode: ino 0, orig_ino 0, attr_version 0, state 0 \
attr_valid 65, attr_version 0
<<<< fuse_change_attributes_common is missing >>>>
<< restart caching here >>
08/15/24 11:35:01.464803 fuse_dentry_settime: func corrupt_symlink d_time 0 ino 270 new_time 68633147
08/15/24 11:35:01.464815 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:01.464836 fuse_readlink_page: func corrupt_symlink ino 0 inode 270 size 22
Current time: 08/15/24 11:35:01.465255963 UTC
SYMLINK CHANGED: 703167 size 14 symlinktest/fg
<< correct value but kernel works with too large size 22 instead of 14 >>
08/15/24 11:35:01.466162 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
08/15/24 11:35:01.466174 fuse_get_link: func corrupt_symlink dentry-inode 270 dentry-size 22 inode-inode 270 inode-size 22
08/15/24 11:35:01.466447 fuse_dentry_revalidate: func corrupt_symlink d_time 0 ino 270 size 22 flags 16448
Looking at the good and bad logs, the only difference is fuse_change_attributes_common()
not being called from fuse_change_attributes()
. This is only possible if the following branch is being taken:
if ((attr_version != 0 && fi->attr_version > attr_version) ||
test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
spin_unlock(&fi->lock);
return;
}
However, looking at bpftrace (see code being used below [B]), I do not understand how this branch can be taken as in the logs it is stated:
attr_version == 0
and &fi->state == 0
with struct fuse_inode *fi = get_fuse_inode(inode);
I also have looked where set_bit(FUSE_I_SIZE_UNSTABLE,
is called and attached to the kprobes of it. But none of them got triggered.
fuse_do_setattr
fuse_perform_write
fuse_file_fallocate
__fuse_copy_file_range
The complete logs and bpftrace script can be found here: https://cernbox.cern.ch/s/EBVedOKe6ng4O0H
[0] https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/0158a70b-3d78-ce18-f38f-b8563d9efcdd@cern.ch/ [1] https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/fbb2f999-6b37-5875-deca-989a75ce3b5b@spawn.link/ [2] https://github.com/cvmfs/cvmfs
[A]
/*
* Check whether the dentry is still valid
*
* If the entry validity timeout has expired and the dentry is
* positive, try to redo the lookup. If the lookup results in a
* different inode, then let the VFS invalidate the dentry and redo
* the lookup once more. If the lookup results in the same inode,
* then refresh the attributes, timeouts and mark the dentry valid.
*/
static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
[B] (bpftrace version v0.14.0, ran on platform 6.5.0-26-generic #26~22.04.1-Ubuntu
)
kprobe:fuse_change_attributes
/ ((struct inode *) arg0)->i_ino == 270 / {
$x = nsecs;
printf("%s fuse_change_attributes:\tfunc %s inode_ino %ld inode_size %ld, inode_version.counter %ld, inode_state %lu fuse_attr ino %ld, fuse_attr size, fuse_attr %ld blksize %ld, fuse_statx ino %ld, fuse_statx size %ld fuse_statx blksize %ld, fuse_inode ino %ld, fuse_inode orig_ino %lu, fuse_inode attr_version %lu fuse_inode state %ld -- attr_valid %lu, attr_version %lu\n",
strftime("%D %T.%f", $x), comm,
((struct inode *) arg0)->i_ino,
((struct inode *) arg0)->i_size,
((struct inode *) arg0)->i_version.counter,
((struct inode *) arg0)->i_state,
((struct fuse_attr*)arg1)->ino,
((struct fuse_attr*)arg1)->size,
((struct fuse_attr*)arg1)->blksize,
((struct fuse_statx*)arg2)->ino,
((struct fuse_statx*)arg2)->size,
((struct fuse_statx*)arg2)->blksize,
((struct fuse_inode *)((struct inode *) arg0)->i_private)->nodeid,
((struct fuse_inode *)((struct inode *) arg0)->i_private)->orig_ino,
((struct fuse_inode *)((struct inode *) arg0)->i_private)->attr_version,
((struct fuse_inode *)((struct inode *) arg0)->i_private)->state,
arg3, arg4
);
}
there's a race between cvmfs sending an expiry for a symlink and it beginning to return the new value. This means that a readlink() can end up retrieving the old value which then sticks around indefinitely in the kernel cache. The window seems to be a few hundred ms. Reproducer:
compile and run:
then mutate the symlink
/cvmfs/repo/LINK
. you should see invalidation working as expected. Now remove thesleep(1)
and try again - you should see the reported link value not change as expected. This can be further verified by instrumentingcvmfs_readlink()
to log what is being returned.