Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
648 stars 55 forks source link

bees is polling but not returning #78

Open Massimo-B opened 5 years ago

Massimo-B commented 5 years ago

Hi,

on one of my USB btrfs drives, I have bees running for many days successfully, but now it is always running into a polling state and can't be terminated or killed anymore, still using 25% cpu which is one of 4 cores I guess. I'm using space_cache=v2 on that btrfs and it turned out that there was some kernel trace in the syslog when mounting, and freezing the mount somehow. I have run a btrfs check and deleted the space cache. Now the check is fine and btrfs scrub has finished without errors. The btrfs seems to be sane.

2018-10-02 13:46:59 32289.22505<6> crawl_1899:        dst 40K [0x724ee000..0x724f8000] {0x84d2cc000z4000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22505<6> crawl_1899: scan: 56K 0x566da000 [+++d++++++++++] 0x566e8000 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22506<6> crawl_1899: copy: 4K [0x56820000..0x56821000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:46:59 32289.22492<6> crawl_1899: dedup: src 4K [0x73447000..0x73448000] {0xb2e7b5000zf000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22492<6> crawl_1899:        dst 4K [0x5683b000..0x5683c000] {0x84b8c2000z7000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22492<6> crawl_1899: copy: 28K [0x56834000..0x5683b000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:46:59 32289.22505<6> crawl_1899: dedup: src 4K [0x17ce5d000..0x17ce5e000] {0x13d73566000z5000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22505<6> crawl_1899:        dst 4K [0x56841000..0x56842000] {0x84d2f5000z3000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22505<6> crawl_1899: dedup: src 4K [0x71c79000..0x71c7a000] {0x74a92a000z1000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22505<6> crawl_1899:        dst 4K [0x56845000..0x56846000] {0x84d2f5000z7000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:46:59 32289.22505<6> crawl_1899: copy: 12K [0x5683e000..0x56841000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:47:00 32289.22506<6> crawl_1899: scan: 68K 0x56810000 [+++++ddd++++++dd+] 0x56821000 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:47:00 32289.22507<6> crawl_1899: copy: 12K [0x566ae000..0x566b1000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:47:00 32289.22506<6> crawl_1899: dedup: src 4K [0x178971000..0x178972000] {0x13d7428a000z9000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:47:00 32289.22506<6> crawl_1899:        dst 4K [0x568c1000..0x568c2000] {0x84d311000z5000} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:47:00 32289.22506<6> crawl_1899: copy: 20K [0x568bc000..0x568c1000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:47:01 32289.22492<6> crawl_1899: dedup: src 28K [0x68000..0x6f000] {0x3714ef000z0} #281 (deleted)
2018-10-02 13:47:01 32289.22492<6> crawl_1899:        dst 28K [0x56834000..0x5683b000] {0x84b8c2000z0} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:47:01 32289.22492<6> crawl_1899: copy: 4K [0x5683c000..0x5683d000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:47:01 32289.22507<6> crawl_1899: scan: 36K 0x566a8000 [++dd+d+++] 0x566b1000 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:47:01 32289.22506<6> crawl_1899: dedup: src 20K [0x67000..0x6c000] {0x371943000z0} #280 (deleted)
2018-10-02 13:47:01 32289.22506<6> crawl_1899:        dst 20K [0x568bc000..0x568c1000] {0x84d311000z0} snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:47:01 32289.22506<6> crawl_1899: copy: 24K [0x568c2000..0x568c8000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
2018-10-02 13:47:01 32289.22492<6> crawl_1899: scan: 36K 0x56834000 [+++++++d+] 0x5683d000 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
2018-10-02 13:49:22 32289.32295<7> crawl_transid: Polling 196.448s for next 10 transid RateEstimator { count = 1059175, raw = 33.98 / 667.595, ratio = 33.98 / 667.621, rate = 0.0508971, duration(1) = 19.6475, seconds_for(1) = 19.6211 }
2018-10-02 13:52:38 32289.32295<7> crawl_transid: Polling 254.871s for next 10 transid RateEstimator { count = 1059175, raw = 33.6402 / 857.395, ratio = 33.6402 / 857.396, rate = 0.0392353, duration(1) = 25.4872, seconds_for(1) = 25.486 }
2018-10-02 13:53:21 32289.32308<6> crawl_writeback: Saving crawl state

2018-10-02 13:56:53 32289.32295<7> crawl_transid: Polling 331.401s for next 10 transid RateEstimator { count = 1059175, raw = 33.3038 / 1103.69, ratio = 33.3038 / 1103.69, rate = 0.0301748, duration(1) = 33.1402, seconds_for(1) = 33.1394 }

2018-10-02 14:02:24 32289.32295<7> crawl_transid: Polling 431.915s for next 10 transid RateEstimator { count = 1059175, raw = 32.9708 / 1424.06, ratio = 32.9708 / 1424.06, rate = 0.0231526, duration(1) = 43.1916, seconds_for(1) = 43.1908 }

2018-10-02 14:09:36 32289.32295<7> crawl_transid: Polling 564.239s for next 10 transid RateEstimator { count = 1059175, raw = 32.6411 / 1841.74, ratio = 32.6411 / 1841.74, rate = 0.017723, duration(1) = 56.4239, seconds_for(1) = 56.4231 }
^C^C^C^C^C^C^C^C

^C^C^C

And today is 2018-10-04, so no output since 2 days.

kakra commented 5 years ago

It's probably stuck in an uninterruptable state (status "D" in top). You should try if you could get a stacktrace somehow. If your kernel is configured for this, you should be able to do cat /proc/$(pidof bees)/stack. Otherwise try alt+sysrq+w and look at dmesg afterwards.

Massimo-B commented 5 years ago

No, stack I have not enabled here. The bees process looks like this in ps:

root     32289  106  0.0      0     0 pts/14   Zl   Okt02 2964:08 [bees] <defunct>
kakra commented 5 years ago

So it's actually no longer running already (zombie). You cannot kill what's already dead. ;-)

Look in pstree:

  1. What is the parent process? pstree -ps $(pidof bees)
  2. Which state is it in? ps u --pid PIDS_OF_PARENTS
Massimo-B commented 5 years ago
  1. As I was able to kill beesd, the parent is now init(1).
  2. The init process is fine: root 1 0.0 0.0 4376 1436 ? Ss Okt01 0:04 init [3]

However in pstree I see more PIDs as children of that bees process, that I haven't seen in ps. This should be the threads, but can't kill any of them either.

Zygo commented 5 years ago

it turned out that there was some kernel trace in the syslog when mounting, and freezing the mount somehow

Can you post the stack trace from the mount?

kakra commented 5 years ago

@Massimo-B A zombie process will not be cleaned up unless it's parent has read the exit status. Actually, in your case the beesd script was locked up, probably during the umount process. When you killed it, the zombie process was moved to pid 1 as a new parent (pid 1 is the final process reaper during shutdown). You may have still a umount process hanging around.

Massimo-B commented 5 years ago

Can you post the stack trace from the mount?

It was without symbols or names, less useful. I can rebuild my kernel with symbols if that will support you the next time. I don't get that stack trace anymore, after I cleared the space cache and run btrfs check. The mounting of that btrfs looks completely sane now. Just bees is locking in this loop every time I try again after a reboot..

Zygo commented 5 years ago

Repeatable lockups? Try looking in /run/bees/$UUID.status (or run bees with the environment variable BEESSTATUS=/run/bees.status), and see what the threads are doing when it locks up.

kakra commented 5 years ago

Did you have autodefrag running while the VM was active?

Massimo-B commented 5 years ago

No, I have no autodefrag set on any mount since I started with bees.

Here is the BEESSTATUS of my zombie bees process:

# cat /run/bees/5e6644f0-9b64-47c4-b773-529c52fd0dc2.status 
TOTAL:
    addr_block=235280 addr_compressed_offset=86238 addr_eof_e=11222 addr_from_fd=2424 addr_from_root_fd=12852 addr_hole=8 addr_magic=8 addr_ms=1816 addr_uncompressed=149042
    adjust_compressed_offset_correct=164 adjust_exact=4 adjust_exact_correct=4 adjust_hit=168 adjust_offset_high=5106 adjust_offset_hit=164 adjust_offset_low=7483 adjust_try=12757
    block_bytes=890665006 block_hash=220578 block_ms=339339 block_read=223294 block_zero=374
    chase_corrected=59 chase_hit=168 chase_no_data=12589 chase_try=12852 chase_uncorrected=109 chase_wrong_addr=95
    crawl_create=344 crawl_empty=1 crawl_gen_high=51366 crawl_gen_low=687 crawl_hole=243 crawl_inline=623 crawl_items=132467 crawl_ms=26982 crawl_nondata=61015 crawl_push=18533 crawl_restart=1 crawl_scan=17920 crawl_search=180
    dedup_bytes=2965504 dedup_copy=1724416 dedup_hit=160 dedup_ms=8200 dedup_try=160 dedup_unique_bytes=2424832
    exception_caught=7
    hash_already=129035 hash_bump=77833 hash_erase=6 hash_evict=11606 hash_extent_in=32 hash_extent_out=59 hash_front=83 hash_insert=89474 hash_lookup=219770 hash_toxic=22
    inserted_block=218097 inserted_clobbered=68
    matched_0=218956 matched_1_or_more=97 matched_2_or_more=1 matched_3_or_more=1
    open_clear=2 open_file=11973 open_hit=11973 open_ino_ms=48421 open_lookup_ok=11973 open_root_ms=183
    pairbackward_hit=64 pairbackward_miss=73 pairbackward_ms=10429 pairbackward_same=1 pairbackward_stop=160 pairbackward_try=139
    pairforward_eof_first=46 pairforward_hit=500 pairforward_miss=57 pairforward_ms=2033 pairforward_same=1 pairforward_stop=160 pairforward_toxic_hash=2 pairforward_try=606
    replacedst_dedup_hit=86 replacedst_grown=46 replacedst_same=7 replacedst_try=97
    replacesrc_dedup_hit=74 replacesrc_grown=57 replacesrc_try=80
    resolve_fail=9 resolve_ms=100056 resolve_ok=216 resolve_toxic=3
    root_clear=2 root_found=8 root_ok=8 root_parent_open_ok=8 root_parent_open_try=8
    scan_block=219600 scan_bump=86 scan_dup_block=351 scan_dup_hit=86 scan_dup_miss=3 scan_extent=17881 scan_forward=17881 scan_found=206907 scan_hash_hit=86 scan_hash_miss=3 scan_hash_preinsert=219600 scan_lookup=219226 scan_no_rewrite=15 scan_push_front=86 scan_reinsert=412 scan_resolve_hit=99 scan_resolve_zero=3 scan_rewrite=80 scan_toxic_hash=168 scan_toxic_match=4 scan_twice=206632
    scanf_extent=17870 scanf_extent_ms=1255183 scanf_total=17877 scanf_total_ms=1540892
    sync_count=81 sync_ms=181042
    tmp_aligned=82 tmp_block=421 tmp_bytes=1724416 tmp_copy=81 tmp_create=4 tmp_create_ms=796 tmp_resize=82 tmp_resize_ms=3355
RATES:
    addr_block=113.258 addr_compressed_offset=41.513 addr_eof_e=5.402 addr_from_fd=1.167 addr_from_root_fd=6.187 addr_hole=0.004 addr_magic=0.004 addr_ms=0.875 addr_uncompressed=71.745
    adjust_compressed_offset_correct=0.079 adjust_exact=0.002 adjust_exact_correct=0.002 adjust_hit=0.081 adjust_offset_high=2.458 adjust_offset_hit=0.079 adjust_offset_low=3.603 adjust_try=6.141
    block_bytes=428740 block_hash=106.18 block_ms=163.349 block_read=107.488 block_zero=0.181
    chase_corrected=0.029 chase_hit=0.081 chase_no_data=6.06 chase_try=6.187 chase_uncorrected=0.053 chase_wrong_addr=0.046
    crawl_create=0.166 crawl_empty=0.001 crawl_gen_high=24.727 crawl_gen_low=0.331 crawl_hole=0.117 crawl_inline=0.3 crawl_items=63.766 crawl_ms=12.989 crawl_nondata=29.371 crawl_push=8.922 crawl_restart=0.001 crawl_scan=8.627 crawl_search=0.087
    dedup_bytes=1427.51 dedup_copy=830.085 dedup_hit=0.078 dedup_ms=3.948 dedup_try=0.078 dedup_unique_bytes=1167.24
    exception_caught=0.004
    hash_already=62.114 hash_bump=37.467 hash_erase=0.003 hash_evict=5.587 hash_extent_in=0.016 hash_extent_out=0.029 hash_front=0.04 hash_insert=43.071 hash_lookup=105.791 hash_toxic=0.011
    inserted_block=104.986 inserted_clobbered=0.033
    matched_0=105.4 matched_1_or_more=0.047 matched_2_or_more=0.001 matched_3_or_more=0.001
    open_clear=0.001 open_file=5.764 open_hit=5.764 open_ino_ms=23.309 open_lookup_ok=5.764 open_root_ms=0.089
    pairbackward_hit=0.031 pairbackward_miss=0.036 pairbackward_ms=5.021 pairbackward_same=0.001 pairbackward_stop=0.078 pairbackward_try=0.067
    pairforward_eof_first=0.023 pairforward_hit=0.241 pairforward_miss=0.028 pairforward_ms=0.979 pairforward_same=0.001 pairforward_stop=0.078 pairforward_toxic_hash=0.001 pairforward_try=0.292
    replacedst_dedup_hit=0.042 replacedst_grown=0.023 replacedst_same=0.004 replacedst_try=0.047
    replacesrc_dedup_hit=0.036 replacesrc_grown=0.028 replacesrc_try=0.039
    resolve_fail=0.005 resolve_ms=48.165 resolve_ok=0.104 resolve_toxic=0.002
    root_clear=0.001 root_found=0.004 root_ok=0.004 root_parent_open_ok=0.004 root_parent_open_try=0.004
    scan_block=105.71 scan_bump=0.042 scan_dup_block=0.169 scan_dup_hit=0.042 scan_dup_miss=0.002 scan_extent=8.608 scan_forward=8.608 scan_found=99.6 scan_hash_hit=0.042 scan_hash_miss=0.002 scan_hash_preinsert=105.71 scan_lookup=105.53 scan_no_rewrite=0.008 scan_push_front=0.042 scan_reinsert=0.199 scan_resolve_hit=0.048 scan_resolve_zero=0.002 scan_rewrite=0.039 scan_toxic_hash=0.081 scan_toxic_match=0.002 scan_twice=99.467
    scanf_extent=8.603 scanf_extent_ms=604.209 scanf_total=8.606 scanf_total_ms=741.741
    sync_count=0.039 sync_ms=87.149
    tmp_aligned=0.04 tmp_block=0.203 tmp_bytes=830.085 tmp_copy=0.039 tmp_create=0.002 tmp_create_ms=0.384 tmp_resize=0.04 tmp_resize_ms=1.616
THREADS (work queue 40 tasks):
    tid 22492: crawl_1899: [1548.55s] Extending matching range: BeesRangePair: 4K src[0x774ee000..0x774ef000] dst[0x568e2000..0x568e3000]
src = 3554 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
dst = 3554 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
    tid 22505: crawl_1899: [1551.07s] Rewriting bfr 12K [0x5683e000..0x56841000] fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi'
    tid 22506: crawl_1899: [1548.92s] syncing #280 (deleted)
    tid 22507: crawl_1899: [1548.77s] resolving 0x13d70b21000z1000 matched BeesBlockData { 4K 0x568dd000 fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi', address = 0x7c0bbf000z1000, hash = 0xb7acf19adbaaa78b, data[4096] }
    tid 32289: bees: [2077.4s] main
    tid 32295: crawl_transid: [193.918s] waiting 564.239s for next 10 transid RateEstimator { count = 1059175, raw = 32.6411 / 1841.74, ratio = 32.6411 / 2035.65, rate = 0.0160347, duration(1) = 62.3649, seconds_for(1) = 1 }
    tid 32296: status: writing status to file '/run/bees//5e6644f0-9b64-47c4-b773-529c52fd0dc2.status'
    tid 32308: crawl_writeback: [1169.12s] closing beescrawl.dat.tmp in .beeshome
    tid 32309: hash_writeback: [1226.65s] idle after writing 32 of 32 extents
    tid 32310: hash_prefetch: [2024.71s] idle 3600s
Zygo commented 5 years ago

This could be the slow backrefs issue:

tid 22506: crawl_1899: [1548.92s] syncing #280 (deleted)
tid 22507: crawl_1899: [1548.77s] resolving 0x13d70b21000z1000 matched BeesBlockData { 4K 0x568dd000 fd = 3554 'snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi', address = 0x7c0bbf000z1000, hash = 0xb7acf19adbaaa78b, data[4096] }

except that doesn't explain why this one is stuck doing a read:

tid 22492: crawl_1899: [1548.55s] Extending matching range: BeesRangePair: 4K src[0x774ee000..0x774ef000] dst[0x568e2000..0x568e3000]
src = 3554 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi
dst = 3554 snapshots/VirtualMachines/VirtualMachines.20171130T104151+0100/OpenSuse LEAP/OpenSuse LEAP.vdi

and it also doesn't explain why bees's toxic extent detector didn't stop processing this file before it got to this point.

Without the kernel stack trace I can't confirm that it is a slow backref (in which case you just wait until it finishes, and bees blacklists the extent so it never goes back) or some other bug.

Zygo commented 5 years ago

Followup question: in the bees log, are there any lines with the words 'WORKAROUND', 'PERFORMANCE', or 'toxic' on them?

Zygo commented 5 years ago

Followup question 2: how long does

    btrfs ins log 0x13d70b21000 /path/to/your/fs

take to run (without running bees)? What is the size of the .vdi file?

Massimo-B commented 5 years ago

Followup question: in the bees log, are there any lines with the words 'WORKAROUND', 'PERFORMANCE', or 'toxic' on them?

I'm not on that machine but yes, I have seen a bunch of those 3 keywords in the stdout, on both bees I was running. Is that a bad thing? I got around the issue by just deleting those snapshots of the Virtual Machines as kakra mentioned. On my main btrfs where those Virtual Machines are located, I have copied all the files to new files before starting a bees run. kakra said it's all about that I had autodefrag while running the Virtual Machines. I disabled autodefrag now.

Just without autodefrag I fear running into huge fragmentation at least in some month or year. Would be nice if bees could actually do some optimization concerning fragmentation when moving data around.

The 'btrfs ins log...' test I can provide next week.

kakra commented 5 years ago

@Massimo-B wrote:

I got around the issue by just deleting those snapshots of the Virtual Machines as kakra mentioned.

Let's not break media here... I mentioned this on IRC. ;-)

About your concerns with fragmentation: Hasn't been a problem here yet. But you're free to defrag the virtualbox image once in a while.

Zygo commented 5 years ago

The 'btrfs ins log...' test I can provide next week.

This might not reveal anything after the snapshot is deleted; however, just fixing the problem by deleting the snapshot also smells like the slow backrefs bug.

have seen a bunch of those 3 keywords in the stdout, on both bees I was running. Is that a bad thing?

Normally, no. bees will detect and record what extents it was running on when slowness happens to avoid returning to a toxic extent. Very common block values are hardcoded into bees so they are avoided as toxic immediately.

If we see a few of these with longer and longer runtimes just before a hang, it can mean bees is failing to detect an emerging toxic extent despite evidence that one is emerging. I like to know what those look like in case there is some new heuristic I can add to avoid them.

Massimo-B commented 5 years ago

Happened again, and bees is not terminating with CTRL+c, and already zombied:

# pstree -lhp |grep bees
        │                    ├─python2(27206)─┬─bash(27511)───bash(27529)───sudo(30617)───bash(30621)───beesd(23873)───bees(23917)─┬─{bees}(26834)
        │                    │                │                                                                                    ├─{bees}(26835)
        │                    │                │                                                                                    └─{bees}(28792)

# cat /proc/23917/status
Name:   bees
State:  Z (zombie)
Tgid:   23917
Ngid:   0
Pid:    23917
PPid:   23873
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 0
Groups: 0 1 2 3 4 6 10 11 26 27 4729 4730 
NStgid: 23917
NSpid:  23917
NSpgid: 23873
NSsid:  27511
Threads:    4
SigQ:   2/63847
SigPnd: 0000000000000000
ShdPnd: 0000000000000002
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp:    0
Speculation_Store_Bypass:   vulnerable
Cpus_allowed:   ff
Cpus_allowed_list:  0-7
voluntary_ctxt_switches:    8
nonvoluntary_ctxt_switches: 3

/proc/23917/stack is empty.

Zygo commented 5 years ago

We really need those kernel stacks:

ps xar | while read -r x y; do ps "$x"; head -50 --verbose /proc/"$x"/task/*/stack; done | tee lockup-stacks.txt

The pstree output is typical for kernel bugs: one or more of bees's threads get stuck somewhere in the kernel, and never return (or take a very long time). It's never the main thread (the one you get from /proc/\<pid>/stack) because that thread doesn't do anything.