Open simonbcn opened 2 years ago
Some more information: I met this issue four times in the last month. When I meet this issue, I need to long-press the power key to shutdown because even the shutdown freezes. ps -f
also freezes. After killing qBittorrent by SIGKILL, ps -f
works but qBittorrent is defunct, new qBittorrent cannot be opened, and shutdown still freezes.
When I meet this issue, I need to long-press the power key to shutdown because even the shutdown freezes.
ps -f
also freezes.
Exactly same problem. I had to press Alt + Impr Pant + R E I S U B to reboot the system.
Could be related: #6937
It's possible.
I have read that issue and what is not clear to me is how to change the libtorrent
values: close_file_interval
and disk_io_read_mode
. In the Advanced
section of qBittorrent preferences I cannot find these values.
How many torrents do you have and how old is the settings profile? Did you have similar issues with a fresh profile and only a very small amount of torrents? Just wondering if the issue I've encountered could be similar to this or distinct
Around 2431 torrents. I don't know how old my profile is, how do I see that? The configuration file is dated today but it is much older.
I'm not sure how to check its date if it's not on any of the files, but it sure sounds old enough given the amount of torrents. I lately restored my old profile that I cleaned up due to issues I had generally with qBittorrent in order to check some files of older torrents, and also noticed it was lagging heavily, especially on startup (but not instantly, it could be around a second after opening). Switching back to the new almost-fresh profile made it work as it should on the other hand. My amount of torrents was much smaller though, it could be around 100 or so, perhaps too many of them in queued/unfinished state? Worth to note I'm on Windows though, which is why I'm unsure if it could be the same thing or not. But try renaming the old profile (not sure if on Linux the config and the stuff like BT_backup
is stored in the same dir, but if not, then all of them) and have just like 1-3 torrents in the new one with default settings and see if the freezes occur there as well, if they do then the issue would be distinct
I have gone back to using version 1.x of libtorrent, I don't really need the features of version 2.x. When libtorrent 2.x manages memory better, I will try it again.
I also have similar issues I've reverted to the 4.3.9 version that is in a docker container but I prefer to run natively. The thing is, my profile is really old so old that it even was running on ubuntu 18.04. on top of that, I am running 2500+ torrents and if I use the newest version of qbittorrent(nox) it really chugs sometimes. This is also on a consistent interval but gets worse over time, memory use does not get worse but the virtual memory requested is insane and higher than what my system possibly even has. in the range of 1024 gigs and that is more than the boot disk even has. I've been looking at the settings file but to my knowledge, it does not seem to have any issues. if anyone wants the config I could send it but not publicly.
I can also confirm that I hit it with libtorrent-rasterbar-2.0.7 and net-p2p/qbittorrent-4.4.5-r1. My qbittorrent profile has a handful of years for sure.
The problem is not the qbittorrent profile or its age, since the problem is not in qbittorrent. The problem is caused by libtorrent 2.x because of the way it manages memory and/or resources.
I wonder if its really libtorrent problem or libtorrent is just a trigger to it. When it happens, the qbittorrent process goes defunct, 'ps aux' will hang on reading the /proc/PID_OF_QBT/cmdline indefinitely and killing it by pid is the only way to recovery, however it will then become zombie and hang around as qbittorrent <defunct>
. That sounds awfully like we have some striking issue in Linux Kernel and libtorrent just manages to get there, otherwise it shouldn't ever require a full system reboot to recovery it.
There is an interesting discussion in this qBittorrent thread: https://github.com/qbittorrent/qBittorrent/issues/17545#issuecomment-1225777480
I am not entirely sure if it's related, but since rebuilding qbittorrent with 1.2 series libtorrent I no longer have this issue, and I no longer have null pointer popping out in my kernel log either. The issue could be related to triggering in-kernel fault after all
Oct 11 04:28:23 [kernel] [111115.499140] BUG: kernel NULL pointer dereference, address: 00000000000000d6
Oct 11 04:28:23 [kernel] [111115.499476] #PF: supervisor read access in kernel mode
Oct 11 04:28:23 [kernel] [111115.499744] #PF: error_code(0x0000) - not-present page
Oct 11 04:28:23 [kernel] [111115.499999] PGD 0 P4D 0
Oct 11 04:28:23 [kernel] [111115.500251] Oops: 0000 [#1] PREEMPT SMP NOPTI
Oct 11 04:28:23 [kernel] [111115.500505] CPU: 7 PID: 16002 Comm: Disk Tainted: G O 6.0.0 #1
Oct 11 04:28:23 [kernel] [111115.500767] Hardware name: To Be Filled By O.E.M. B550 Taichi/B550 Taichi, BIOS P2.20 09/29/2022
Oct 11 04:28:23 [kernel] [111115.501031] RIP: 0010:__filemap_get_folio+0x9c/0x440
Oct 11 04:28:23 [kernel] [111115.501300] Code: 10 e8 a8 a1 5e 00 48 89 c3 48 3d 02 04 00 00 74 e2 48 3d 06 04 00 00 74 da 48 85 c0 0f 84 d2 02 00 00 a8 01 0f 85 d4 02 00 00 <8b> 40 34 85 c0 7
4 c2 8d 50 01 f0 0f b1 53 34 75 f2 48 8b 54 24 28
Oct 11 04:28:23 [kernel] [111115.501595] RSP: 0000:ffff9699c751bc88 EFLAGS: 00010246
Oct 11 04:28:23 [kernel] [111115.501888] RAX: 00000000000000a2 RBX: 00000000000000a2 RCX: 0000000000000002
Oct 11 04:28:23 [kernel] [111115.502186] RDX: 000000000000002c RSI: ffff8b6ff30c9488 RDI: ffff9699c751bc98
Oct 11 04:28:23 [kernel] [111115.502482] RBP: 0000000000000000 R08: 000000000000f96f R09: 000000000000f970
Oct 11 04:28:23 [kernel] [111115.502780] R10: ffffffffffffffc0 R11: 0000000000000000 R12: 0000000000000000
Oct 11 04:28:23 [kernel] [111115.503080] R13: ffff8b76b840a2b0 R14: 000000000000f96f R15: ffff8b6f06fd6be0
Oct 11 04:28:23 [kernel] [111115.503380] FS: 00007fe6853ff6c0(0000) GS:ffff8b7dfe3c0000(0000) knlGS:0000000000000000
Oct 11 04:28:23 [kernel] [111115.503685] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 04:28:23 [kernel] [111115.503989] CR2: 00000000000000d6 CR3: 0000000104934000 CR4: 0000000000750ee0
Oct 11 04:28:23 [kernel] [111115.504295] PKRU: 55555554
Oct 11 04:28:23 [kernel] [111115.504600] Call Trace:
Oct 11 04:28:23 [kernel] [111115.504906] <TASK>
Oct 11 04:28:23 [kernel] [111115.505208] filemap_fault+0x5a/0x990
Oct 11 04:28:23 [kernel] [111115.505515] __do_fault+0x36/0x130
Oct 11 04:28:23 [kernel] [111115.505820] do_fault+0x1da/0x440
Oct 11 04:28:23 [kernel] [111115.506123] __handle_mm_fault+0x64d/0xe40
Oct 11 04:28:23 [kernel] [111115.506428] handle_mm_fault+0xae/0x290
Oct 11 04:28:23 [kernel] [111115.506732] do_user_addr_fault+0x1cd/0x690
Oct 11 04:28:23 [kernel] [111115.507037] exc_page_fault+0x70/0x170
Oct 11 04:28:23 [kernel] [111115.507343] asm_exc_page_fault+0x22/0x30
Oct 11 04:28:23 [kernel] [111115.507647] RIP: 0033:0x7fe78358b409
Oct 11 04:28:23 [kernel] [111115.507947] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 83 fa 20 72 27 <c5> fe 6f 06 48 8
3 fa 40 0f 87 a9 00 00 00 c5 fe 6f 4c 16 e0 c5 fe
Oct 11 04:28:23 [kernel] [111115.508275] RSP: 002b:00007fe6853fd428 EFLAGS: 00010202
Oct 11 04:28:23 [kernel] [111115.508603] RAX: 00007fe7440063a0 RBX: 0000000000000000 RCX: 00007fe6853fd5d0
Oct 11 04:28:23 [kernel] [111115.508931] RDX: 0000000000004000 RSI: 00007fdf0ef6f1a7 RDI: 00007fe7440063a0
Oct 11 04:28:23 [kernel] [111115.509259] RBP: 00007fe6853fd448 R08: 0000000000000003 R09: 0000000000000000
Oct 11 04:28:23 [kernel] [111115.509587] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
Oct 11 04:28:23 [kernel] [111115.509915] R13: 00007fe774537050 R14: 0000000000000025 R15: 00007fe744001910
Oct 11 04:28:23 [kernel] [111115.510244] </TASK>
I have this same issue on deluge.
deluge: 2.1.1 x64 (running in docker linuxserver/deluge) libtorrent: 2.0.8.0 OS: Unraid (Slackware Linux) 6.11.2
The Unraid community has been tracking this issue here where many of us using both deluge and qbittorent, experienced this same issue.
Note that this issue was only recently reported with the newest series of the Unraid distribution (i.e. Linux kernel). From: v5.15.46 [Worked without issue] To: v5.19.9 (CVE-2022-21123 CVE-2022-21123 CVE-2022-21125 CVE-2022-21166) [BUG: kernel NULL pointer dereference]
Seems like it indeed is kernel bug that is getting triggered with series 2 libtorrent, the same null pointer dereferences errors as my boxes reported above.
Same problem: 6.0.9 gentoo-kernel, qbittorrent 4.4.5, libtorrent-rasterbar 2.0.6. In dmesg kernel NULL pointer dereference and filemap_fault On same system and software versions but with libtorrent-rasterbar 1.2.18 all works as intended.
... Reposting what I posted on deluge forum, in response to JesterEE's reported findings:
Found some hinting at issue with transparent_hugepages in kernel being culprit, and disabling it a functioning workaround.
I think you can test that(disabling), through a kernel cmd-line parameter, or tempfiles.d approach.
https://bugzilla.kernel.org/show_bug.cgi?id=216646
More info about disabling THP here: https://docs.kernel.org/admin-guide/mm/transhuge.html https://access.redhat.com/solutions/46111 https://wiki.archlinux.org/title/Redis#Warning_about_Transparent_Huge_Pages_(THP)
I have it deactivated and qBittorrent is already working correctly.
But the question is: whose fault is this malfunction? libtorrent, qBittorrent/Deluge or the Linux Kernel?
Looks like issue is gone after kernel >= 6.0.12
Not for me.
Are you guys using encryption? I've found disabling queues fixes the issue for me: https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disable_workqueue_for_increased_solid_state_drive_(SSD)_performance
No encryption, but I noticed something. It deadlocks hard, when adding a large torrent in qBit. But if you manage to kill qBit, and restart it, it restarts and works just fine.
I am using simple patch to prevent kernel null pointer dereference. It is not a fix but getting rid of symptoms. Torrent is working and i am do not have to restart it.
--- linux-gentoo/mm/filemap.c 2022-12-14 00:00:00.000 +0200
+++ linux-gentoo-old/mm/filemap.c 2022-12-06 00:00:00.000 +0200
@@ -1850,22 +1850,24 @@
static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
{
XA_STATE(xas, &mapping->i_pages, index);
struct folio *folio;
rcu_read_lock();
repeat:
xas_reset(&xas);
folio = xas_load(&xas);
if (xas_retry(&xas, folio))
goto repeat;
+ if (WARN_ON(unlikely(xa_is_internal(folio))))
+ goto repeat;
/*
* A shadow entry of a recently evicted page, or a swap entry from
* shmem/tmpfs. Return it without attempting to raise page count.
*/
if (!folio || xa_is_value(folio))
goto out;
if (!folio_try_get_rcu(folio))
goto repeat;
if (unlikely(folio != xas_reload(&xas))) {
Are you guys using encryption? I've found disabling queues fixes the issue for me: https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disable_workqueue_for_increased_solid_state_drive_(SSD)_performance
I am using disk encryption. Hit issue with disabled queues.
Problem with kernel page cache and before any reading to page.
@corvus1 could you reproduce if you set aio_threads=1?
@corvus1 could you reproduce if you set aio_threads=1?
where do you put it?
Preferences / Advanced / libtorrent Section
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I still have this problem. https://bugzilla.kernel.org/show_bug.cgi?id=216646 The kernel bug tracker thread suggests it may also relate to the filesystem you use (in my case, xfs).
The kernel bug tracker thread suggests it may also relate to the filesystem you use (in my case, xfs).
This is probably true because in my case I was also using XFS and had the same problem. Now I use ZFS and it works fine.
https://bugzilla.kernel.org/show_bug.cgi?id=216646#c18 says that the kernel bug was not reproducible in BTRFS, but I was using BTRFS when facing the issue here.
I have the exact issue discussed here. Deluge 2.1.1 in Docker, running on kernel version 6.1.0. Each time the issue has occurred my stack trace is consistent with the others shared here. The bug takes time to occur for me and I'm not sure what specifically the triggering mechanism is. At the time I killed the deluge instance this time, the virtual memory usage was around 800G.
As for the filesystem: I am using XFS for initial downloads (on my scratch disk) and then stuff is moved to ZFS for seeding. So it's hard to say which filesystem, if any, is the culprit.
For what it's worth this is a high performance setup on a 5Gbps connection. Looking at the timestamps in the kernel logs, the error occurred for me during a large download of several big torrents at once. It was probably quite busy at the time; I would guess that the vast majority of work was dealing with the XFS writes.
I believe it just recently got fixed upstream, but not yet pulled into any released kernel though so one needs to patch yourself. -- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=151dff099e8e6d9c8efcc75ad0ad3b8eead58704
For visibility in future, the content of patch below
From 151dff099e8e6d9c8efcc75ad0ad3b8eead58704 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 10 Mar 2023 08:00:23 +0100
Subject: mm-return-an-err_ptr-from-__filemap_get_folio-fix
fix null-pointer deref
Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/swap_state.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c7160070b9daa..b76a65ac28b31 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -390,6 +390,8 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
struct swap_info_struct *si;
struct folio *folio = filemap_get_entry(mapping, index);
+ if (!folio)
+ return ERR_PTR(-ENOENT);
if (!xa_is_value(folio))
return folio;
if (!shmem_mapping(mapping))
The above short patch is actually a late addendum to a much larger patch that's still in the works. We'll definitely need the larger patch since it covers more bases and specifically the issue that is causing the stack traces posted so far.
Poking around at my own stack trace it looks like filemap_fault
calls __filemap_get_folio
, which then calls mapping_get_entry
which returns null, then calls xa_is_value
and deref'd that null as its first argument. I think it was reworked in the above patch set to avoid this behavior but I cannot say for sure without digging deeper.
I believe it just recently got fixed upstream, but not yet pulled into any released kernel though so one needs to patch yourself. -- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=151dff099e8e6d9c8efcc75ad0ad3b8eead58704
For visibility in future, the content of patch below
From 151dff099e8e6d9c8efcc75ad0ad3b8eead58704 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <hch@lst.de> Date: Fri, 10 Mar 2023 08:00:23 +0100 Subject: mm-return-an-err_ptr-from-__filemap_get_folio-fix fix null-pointer deref Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Naoya Horiguchi <naoya.horiguchi@linux.dev> Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004 Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/swap_state.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/swap_state.c b/mm/swap_state.c index c7160070b9daa..b76a65ac28b31 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -390,6 +390,8 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping, struct swap_info_struct *si; struct folio *folio = filemap_get_entry(mapping, index); + if (!folio) + return ERR_PTR(-ENOENT); if (!xa_is_value(folio)) return folio; if (!shmem_mapping(mapping))
This patch will not fix this issue.
filemap_get_entry() sometime returns internal xarray value. In my case it is 0x02; It is not NULL and not valid pointer.
https://github.com/arvidn/libtorrent/issues/6952#issuecomment-1356760012 works for me.
I stand corrected. None the less good that this has been acknowledged upstream and work on fix is undergoing, eventually will arrive in mainline.
FWIW, I forwarded this issue over to Christoph Hellwig in case he has a moment to consider its relevance to his patch.
@shanechko is correct
I just spent some more time digging into this and it looks to me that there is some code in the kernel that uses the xarray advanced API without fully checking/resolving internal members. The result is that an internal sibling member (which is basically just an index that points to another slot in the array) is being used as a pointer, which of course gets deref'd shortly after and causes a kaboom.
The solution is to fix the xarray advanced callers (i.e., callers of xas_load) so they deal with sibling entries properly. I don't have the environment spun up to build and test my own fix but I've sent it over to the kernel devs to see what they think.
It is possible that the xarray getting into a state where multi-indexes/sibling members are used is unusual, perhaps a path activated by using XFS as many here have done.
I believe the patch @shanechko made may work, but most likely the "right" thing to do with sibling members is to use the xas_for_each_conflict iterator as explained in the docs here: https://docs.kernel.org/core-api/xarray.html#multi-index-entries
filemap_get_entry() sometime returns internal xarray value. In my case it is 0x02
Verify that the memory address belongs to this process, done. Are other bad values other than 0x02 and 0x00 possible?
some code in the kernel that uses the xarray advanced API without fully checking/resolving internal members
Is that why it is so fast (≧ω≦)/
It has been stable for me after updating to kernel 6.5.0-rc1, can anyone else update and check it out?
I deployed 6.5.0-rc2 and 2.x series libtorrent yesterday, so far stable, waiting few more days to see if it does not triggers again.
What would be the commit that fixed the issue in the kernel? It is not "fix null-pointer deref" from above, right?
BTW, this only affects some non-standard filesystems, not ext4?
What would be the commit that fixed the issue in the kernel? It is not "fix null-pointer deref" from above, right?
Yes. kernel 6.5 add some locks for process virtual memory mapping to prevent races like this.
BTW, this only affects some non-standard filesystems, not ext4?
I think no. This issue caused by memory management. Race in page mapping code.
I was able to reproduce the issue on 6.5-rc3. The proposed patch from Matthew Wilcox is here: https://bugzilla.kernel.org/attachment.cgi?id=304705&action=diff&collapsed=&headers=1&format=raw (from https://bugzilla.kernel.org/show_bug.cgi?id=217441)
It looks like it does the job for me. Can someone else who is impacted also check whether it resolves the problem?
Fix in upstream kernel commit 31cd0859dc1fe345ebacc319b72a03e9cf80cf41 Version 6.5.3
Fix in upstream kernel commit
The "upstream" is always Linus's tree - latest release being 6.6-rc1 at the moment ;) Commit ID there is cbc02854331edc6dc22d8b77b6e22e38ebc7dd51: https://github.com/torvalds/linux/commit/cbc02854331edc6dc22d8b77b6e22e38ebc7dd51
Also fixed in stable releases: 6.5.3 - commit 31cd0859dc1fe345ebacc319b72a03e9cf80cf41 6.4.16 - commit ec5cf2977cff6b0081793e73d1af7f1a165959b8
And backported to longterm 6.1: 6.1.53 - commit 4a3e0d51c3e10ef699fc201542593f4bbcd39fd3
bumping for progress
bumping for progress
I have not faced this issue for a month since kernel 6.5.3.
Me neither
On Thu, Oct 19, 2023 at 2:50 PM Yufan You @.***> wrote:
bumping for progress
I have not faced this issue for a month since kernel 6.5.3.
— Reply to this email directly, view it on GitHub https://github.com/arvidn/libtorrent/issues/6952#issuecomment-1770925098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGVCF72EQR76R4MKNIIUA3YAEO2FAVCNFSM52WCUXI2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZXGA4TENJQHE4A . You are receiving this because you commented.Message ID: @.***>
qbittorrent: 4.4.3.1 x64 OS: Arch Linux Qt: 6.3.0 and 5.15.4 libtorrent-rasterbar: 2.0.6
I reported it first in the qBittorrent bugtracker but the problem only happens with version 2.x of libtorrent, if I use version 1.x it works without problems. So I'm not sure where the problem really lies: in the program or in the library.
I have the computer always on with qBittorrent running. Every now and then qBittorrent freezes. I have to restart the computer because I can't even kill the process.