chaos / diod

Distributed I/O Daemon - a 9P file server
GNU General Public License v2.0
350 stars 56 forks source link

disappearing 9p mount - troubleshooting ideas #42

Closed borkd closed 6 years ago

borkd commented 6 years ago

I recognize the kernel module is ultimately responsible for the filesystem mount, but can you share a suggested way to troubleshoot client issues, where diodmount/9p mountpoints started to disappear on some machines?

Amount of traffic processed does not seem to play a role here. On some machines things work fine for random amount of time, then all further operations return Input/output error. On others, running the same kernel, the issue is never triggered.

Background: Recently some of the clients running started loosing mountpoints. I suspect factors external to diod/9p play a role, Meltdown and Spectre mitigation being one, but I'd like to figure out whether others are encountering this issue as well, and if not, the best way to easily reproduce it for kernel folks.

Server: diod server on RH 6.6, built from 1.0.24 release using ./configure --enable-rdmatrans --with-ncurses

Clients:

Timeline:

Thanks!

justcsdr commented 6 years ago

From what I have noted, this appeared after a kernel upgrade. They have changed something in the kernel module.

borkd commented 6 years ago

I have updated the original issue with more details. @justcsdr - what kernel ver was broken for you?

borkd commented 6 years ago

As I expected the issue was external to diod and 9p which came with the kernel. Packet dumps and wireshark came in handy

garlick commented 6 years ago

If you have a moment, it might be nice to give a brief explanation here of what you found out, just in case other people encounter similar problems with symptoms initially pointing to diod/9P.

justcsdr commented 6 years ago

Sorry about the confusion (it was a few months ago). Not the mount point disappeared but files. The name of the files was still visible in the directory listing, but the data was not accessible until a re-mount. From what I remember it started around version 4.14 of the Linux kernel. In the meanwhile I have switched off the diod server because I didn't need it any more so I don't know if the problem is still present. There are still a few open bugs for the 9p kernel module from more then an year ago (two of them are opened by me and there are patches to correct them there) but the maintainer is quiet.

https://bugzilla.kernel.org/buglist.cgi?bug_status=__open__&component=v9fs&list_id=961899&product=File%20System

borkd commented 6 years ago

@garlick - the problem was both trivial and obscure at the same time and would likely not happen in a typical deployment. My systems are stateless and their hostnames and network configuration is auto-generated based on certain set of parameters. Duplicate data in upstream infrastructure led to hash conflicts and thus temporary IP overlaps, BAM. Data validation in semi-autonomous decentralized systems is both essential and tricky.

@justcsdr - Thanks for your response. FWITW - I've been using diod with filesystems with deep and wide directory trees with lots of files, and https://bugzilla.kernel.org/show_bug.cgi?id=195663 never manifested itself. All clients consume 9p mounted data natively rather than via samba reexports.