Excessive-seeming disk writes when idle

dstu commented 11 months ago

Describe the bug

When Nheko is sitting idle and connected to a low-traffic server (unfederated, no one's saying anything or sending reactions or other events I can discern in any rooms I'm in), I see I/O activity that I do not expect.

My hard drive access light won't stop blinking
My network monitor shows steady usage of about 15 k/s downstream and 5 k/s upstream.
iotop reports writes in the range of 3-5 MB/s.

To Reproduce

Launch Nheko.
Connect to server.
Launch iotop, strace, use a disk monitoring applet, ...
Look for ongoing I/O.

What happened?

If I attach strace, I see that Nheko is in some sort of tight I/O. (See log output field for a snippet.) Operations I can readily discern which keep happening:

Opening and reading all of $HOME/.cache/nheko/nheko/curl_alt_svc_cache.txt. This file consists of comments and appears to be a stub.
Writing a few kilobytes to $HOME/.local/share/nheko/nheko/407374753a777a72647a2e636f6f6c/data.mdb.
Polling/notifying via eventfd.

Does (1) need to happen? Maybe it's the result of using libcurl to monitor for network events. Seems weird, but I'm not an expert on libcurl or Nheko internals.

(2) seems weird. The contents of data.mdb keep changing (verified by looking at checksums), but its size does not. Is there some metadata that gets updated on every iteration of a polling loop? And why is the same file open via two different file descriptors? I also see $HOME/.local/share/nheko/nheko/407374753a777a72647a2e636f6f6c/lock.mdb, which suggests that it should not be.

(3) doesn't seem unusual to me.

Expected behavior

No excessive writes or network activity when Nheko is idle.

Screenshots

No response

Version

0.11.3-2+b4

Operating system

Linux

Installation method

Some repository (AUR, homebrew, distribution repository, PPA, etc)

Qt version

No response

C++ compiler

No response

Desktop Environment

X11 (just Window Maker, no fancy DE)

Did you use profiles?

[ ] Profiles used?

Relevant log output

Output from strace:

openat(AT_FDCWD, "/home/stu/.cache/nheko/nheko/curl_alt_svc_cache.txt", O_RDONLY) = 8
newfstatat(8, "", {st_mode=S_IFREG|0640, st_size=117, ...}, AT_EMPTY_PATH) = 0
read(8, "# Your alt-svc cache. https://cu"..., 4096) = 117
read(8, "", 4096)                       = 0
close(8)                                = 0
write(58, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x5602acae1dc0, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=35, events=POLLIN}, {fd=52, events=POLLIN}], 4, 1450) = 1 ([{fd=6, revents=POLLIN}])
read(6, "\1\0\0\0\0\0\0\0", 8)          = 8
pwrite64(62, "\370\v\0\0\0\0\0\0\0\0\2\0X\0\214\1\214\17\210\6\240\5\214\1\350\2,\5\\\3\30\17"..., 4096, 12550144) = 4096
pwrite64(62, "\6\f\0\0\0\0\0\0\0\0\2\0\34\0z\f\332\17\226\fz\f\336\16\332\f\334\rd\5\210\6"..., 4096, 12607488) = 4096
pwrite64(62, "\33\f\0\0\0\0\0\0\0\0\2\0Z\0\206\5\302\16`\vt\10\306\f\344\7(\10\252\t\206\5"..., 4096, 12693504) = 4096
pwrite64(62, "5\f\0\0\0\0\0\0\0\0\1\0\30\0\310\17\370\17\350\17\330\17\310\17\332\f\334\rd\5\210\6"..., 4096, 12800000) = 4096
pwrite64(62, "9\f\0\0\0\0\0\0\0\0\2\0*\0\300\v\300\v\250\17P\17\370\16\240\16H\16\360\r\230\r"..., 4096, 12816384) = 4096
pwrite64(62, "\3570\0\0\0\0\0\0\0\0\2\0f\0\220\1\250\17P\17\370\16\240\16H\16\360\r\230\r@\r"..., 4096, 51310592) = 4096
pwrite64(62, "&2\0\0\0\0\0\0\0\0\1\0000\0\"\r\370\17\"\r&\17\312\17\310\16<\16Z\17\230\r"..., 4096, 52584448) = 4096
pwrite64(62, "12\0\0\0\0\0\0\0\0\2\0R\0\10\5\250\17P\17\0\17\250\16P\16\370\r\240\rH\r"..., 4096, 52629504) = 4096
fdatasync(62)                           = 0
pwrite64(63, "\0\0\0\0\10\0\0\0\0\20\0\0\10\0\2\0\1\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0"..., 120, 4128) = 120
write(6, "\1\0\0\0\0\0\0\0", 8)         = 8
write(6, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=35, events=POLLIN}, {fd=52, events=POLLIN}], 4, 0) = 1 ([{fd=6, revents=POLLIN}])
read(6, "\2\0\0\0\0\0\0\0", 8)          = 8
openat(AT_FDCWD, "/home/stu/.cache/nheko/nheko/curl_alt_svc_cache.txt", O_RDONLY) = 8

(...and it repeats...)

File descriptors according to /proc:

* 62 and 63: /home/stu/.local/share/nheko/nheko/407374753a777a72647a2e636f6f6c/data.mdb
* 6: anon_inode:[eventfd]
* 58: anon_inode:[eventfd]

Backtrace

No response

deepbluev7 commented 11 months ago

The curl_alt_svc_cache is written by curl, there is not really a way for us to disable that without disabling http/3 support. You can open an upstream curl issue though, I think?

The 2 file descriptors for lmdb are normal, one is for writing and one is for reading. The high disk load from that is because on every sync we need to persist the new sync token before we sync again. Otherwise some messages might become undecryptable. Now why you see so many syncs, I can't say, you need to check your sync content for that. You probably get a lot of presence events?

dstu commented 11 months ago

Thanks for taking a look.

The server I'm connected to is pretty sleepy (unfederated, about 10 or 20 people connected/active at any given time, frequent long stretches where no one is saying anything in any rooms). I suppose it's possible that someone could be toggling between online and offline. (Maybe they have a crummy Internet connection or flaky client.) But the rate at which I was seeing writes come in was a lot higher than what I would expect, even if someone were compulsively flipping through apps on a cell phone and causing their Matrix client to flip their online status rapidly.

To complicate things further, these writes aren't happening right now. If I notice that something seems to trigger them, I'll share what it is.

Is there an easy way to get debugging traces with built-in logging to explain sync events (perhaps with an environment variable or suitable command line parameter)? If so, I can make a habit of running Nheko under that configuration.

I can also get debugging symbols and take a look in gdb (although this is not a familiar codebase, so I may not find much). I don't have a lot of time for debugging during the week, unfortunately, but this might be reasonably straightforward.

deepbluev7 commented 11 months ago

No, when I added support for dumping the sync, it caused quite some CPU usage even when turned off, so I removed it again. I usually just patch my Nheko to write it out when I need it.

lionkmp commented 10 months ago

I have similar experiences, but I haven't done such extensive I/O research on it. When nheko is running, my HDD is constantly active (I can hear it, because it's a HDD), also the LED is blinking and the "conky" app shows disk activity. As soon as I close nheko, the system calms down. This is a Debian based distro, with no systemd. IceWM window manager.

Nheko Version: 0.11.3-2

I'm using it on normal use case, not such limited as reported above. I'm in various chat rooms, and I have various 1-1 contacts on it. Still, I'd be very happy if this could be improved e.g. by using a memory cache and less disk writes. For me the only solution is to start nheko when I really need the Matrix - otherwise the HDD activity is too annoying. So I look in for messages, answer them and quit.

Thanks in advance if any improvement could be done for this. Maybe majority of the users do not see it because of using SSD and there is no "noise"?

YokaiRick commented 10 months ago

Mh i also see this in iotop. i did ignore it thou until today. and luckily someone already opened an issue, thanks :3

dstu commented 9 months ago

A possible data point: I had someone tell me today that my presence indicator on our server was flickering between away and present. ("a moment ago it was just "raaaaaaaave"")

I was logged in with Nheko at the time, but the machine that was logged in was idle. I was also logged in on my phone (Element Android client) and in a browser tab on a laptop (Element Web client). The flickering presence indicator happened when the browser tab was the most recent client I had used. The browser tab was in the background. (I was on another workspace, so the browser was effectively minimized/not in focus.)

I make two tentative inferences from this:

Sometimes, someone's Matrix presence can flicker rapidly. This is consistent with previous speculation that there may be a rapidly toggling presence indicator in some Matrix environments, which could trigger this bug because there is disk activity every time there's a presence status change.
There could be some interaction between Nheko and other clients that triggers fast repeated presence status changes. For example: Nheko is away. Another client changes my account's status from "present" to "away." This status change is sent to Nheko, which wakes up and changes the account status from "away" to "present." But the client that thinks I should be away changes the status back to "away," which triggers Nheko again, and so on as the two clients continue fighting over my away status.

Does anyone else who is experiencing this issue habitually connect with another client at the same time and use it while Nheko is idle? Has anyone who shares a Matrix channel with you ever noticed your away status flickering rapidly?

lionkmp commented 9 months ago

@dstu I do not use multiple clients at same time. I'm using Matrix on PC only, mainly when booting Windows I use Element, when booting Linux I use NHeko. Not at same time, since it's one computer. I'm not aware that my "status" would flicker, but I don't know. Thanks for your research, I hope there will be some mitigation to this problem. I also checked "iotop" and noticed NHeko is at top.

Nheko-Reborn / nheko