fujiapple852 / trippy

A network diagnostic tool
https://trippy.cli.rs
Apache License 2.0
3.77k stars 80 forks source link

Trippy BSOD on NetBSD when resizing window #276

Closed fujiapple852 closed 1 year ago

fujiapple852 commented 2 years ago

Starting trippy (0.6.0-dev) and resizing the window will result in a BSOD with the error:

IO error: Interrupted system call (os error 4)
fujiapple852 commented 2 years ago

Also occurs with the existing 0.5.0 release and so not a new regression. @0323pin do you recall if you ever saw this issue?

0323pin commented 2 years ago

@fujiapple852 I don't because, I do not resize windows, I use leftwm (tiling window manager).

But, I can reproduce the error with 0.5.0 when I do a window resize.

fujiapple852 commented 2 years ago

Thank you @0323pin! Given it isn't a regression bug I'll proceed with the 0.6.0 release and investigate this with a view to fixing in 0.7.0.

0323pin commented 2 years ago

No worries, @fujiapple852 Weird that this happens, though.

0323pin commented 2 years ago

@fujiapple852 Just so you know, I've updated the package already but, didn't have the time to merge it. Most probably late this evening or, tomorrow early morning.

fujiapple852 commented 1 year ago

Added #552 and #153 to assist with diagnostic of this issue

fujiapple852 commented 1 year ago

Best guess is this relates to not handling EINT properly.

https://unix.stackexchange.com/questions/509375/what-is-interrupted-system-call

0323pin commented 1 year ago

Sorry for the slow reply, been AFK for a few days.

I've asked internally, if we are missing something obvious here.

fujiapple852 commented 1 year ago

Thanks @0323pin, there is nothing needed right now, ~I've managed to get an AWS NetBSD instance working again and so I should be able to debug this now.~

Edit: I spoke to soon, I'm unable to build the latest master (or even the previous 0.7.0) on my AWS environment. I can install Rust (1.64, a bit old but should be ok) but it fails to build some core Rust packages.

error: could not compile `syn`

Caused by:
  process didn't exit successfully: `rustc --crate-name syn --edition=2018 /root/.cargo/registry/src/github.com-1ecc6299db9ec823/syn-1.0.109/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="clone-impls"' --cfg 'feature="default"' --cfg 'feature="derive"' --cfg 'feature="extra-traits"' --cfg 'feature="full"' --cfg 'feature="parsing"' --cfg 'feature="printing"' --cfg 'feature="proc-macro"' --cfg 'feature="quote"' --cfg 'feature="visit"' --cfg 'feature="visit-mut"' -C metadata=e75730d19c6f40fa -C extra-filename=-e75730d19c6f40fa --out-dir /root/trippy/target/debug/deps -L dependency=/root/trippy/target/debug/deps --extern proc_macro2=/root/trippy/target/debug/deps/libproc_macro2-2836222ef2fb938b.rmeta --extern quote=/root/trippy/target/debug/deps/libquote-fd31c030121713c2.rmeta --extern unicode_ident=/root/trippy/target/debug/deps/libunicode_ident-7347f110e8245f8d.rmeta --cap-lints allow --cfg syn_disable_nightly_tests` (signal: 9, SIGKILL: kill)

I'm using the public AMI ami-041f8cb5cca00f023 which describes itself as NetBSD 9 arm64 2021-07-01a and NetBSD/evbarm-aarch64 9. It is also described as arm64 but in reality it is a evbarm.

It appears to be NetBSD 9.2_STABLE, as that is what pkgin tells me and so I've configured it to pull packages from http://ftp.netbsd.org/pub/pkgsrc/packages/NetBSD/aarch64/9.2/All. I do see a evbppc package repo but not evbarm. There is also https://wiki.netbsd.org/ports/evbarm/

Aside: I can pkgin install trippy and run it without issue, I just can't build it!

0323pin commented 1 year ago

I'm unable to build the latest master (or even the previous 0.7.0) ...

If you want, I can do a test build on bare-metal x86_64.

fujiapple852 commented 1 year ago

@0323pin that would be useful, thanks. I'm going to release 0.8.0 soon so it would be good to confirm everything still works before I do.

Apart from checking that it builds from master, I was also hoping to run against a branch with some new logging enabled to capture more details above the failure above. Would you be able to do that if it's not too much trouble?

git checkout feat-error-context
cargo build
sudo target/debug/trip example.com -m silent -v --log-span-events full > trippy.log

When running, resize the window and it should fail with the Interrupted system call error, then send me the log. Note that the above will generate a lot of output, so i'd only run it for a few seconds before resizing the window.

0323pin commented 1 year ago

I was also hoping to run against a branch with some new logging enabled to capture more details above the failure above. Would you be able to do that if it's not too much trouble?

Yes, I can do this but, I've already built from the 5b5ca30033fee48c2389723a6641a14b21a8a1b2

~> trip -V
trip 0.8.0-dev
~> uname -v
NetBSD 10.99.4 (GENERIC) #0: Fri May 12 13:29:41 UTC 2023  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC
~> pkgin list | grep rust
rust-1.69.0          Safe, concurrent, practical language

2023-05-14-130643_1366x768_scrot

Resizing the window causes the expected failure. I'll try to find the time to build with the logging feature enabled later today. But, at least you know it builds and runs.

0323pin commented 1 year ago

target/debug/trip example.com -m silent -v --log-span-events full > trippy.log with a window resize sent by e-mail.

Hopefully it comes through, it's 19 MB. Let me know if it doesn't get to you and I'll host it on git.

fujiapple852 commented 1 year ago

Thanks, I got the file.

From the trace it doesn't look like it crashed, so I guess it only crashes when you resize when the TUI is running and the backend tracing is running, which the current tracing code doesn't support.

To fix this I think I'll need a netBSD env I can use to debug this directly (also I tried a VM in virtualBox without much luck). What would be a good forum to ask for help for this?

Anyway good news that the latest code that it builds ok, so I can proceed with the release.

0323pin commented 1 year ago

I guess it only crashes when you resize when the TUI is running and the backend tracing is running, which the current tracing code doesn't support.

Yeah, i tried that also but, got verbose option is not available in TUI mode or, something along those lines.

To fix this I think I'll need a netBSD env ... I tried a VM in virtualBox without much luck

I've never used VirtualBox, I always use QEMU

What would be a good forum to ask for help for this?

www.unitedbsd.com Check this thread, https://www.unitedbsd.com/d/348-how-to-use-qemu-to-run-netbsd-91/5 Yes, I'm active there :)

EDIT: 0.8.0 merged, https://mail-index.netbsd.org/pkgsrc-changes/2023/05/15/msg274784.html

fujiapple852 commented 1 year ago

@0323pin I was eventually able to get NetBSD running locally on macOS (intel Mac) using qemu.

I installed qemu:

brew install qemu

I downloaded:

http://nycdn.netbsd.org/pub/NetBSD-daily/netbsd-9/202308261920Z/images/NetBSD-9.3_STABLE-amd64.iso

I created the VM with:

qemu-img create virtualmachine.img 10G

For the initial install I ran:

qemu-system-x86_64 -boot d -cdrom ~/Downloads/NetBSD-9.3-amd64.iso -enable-kvm -m 3G -hda virtualmachine.img

...and followed the prompts (picked defaults for most things, enabled sshd and nntpd)

After installation, i'm running it with (i've tried a few variation on this to try and speed things up):

qemu-system-x86_64 -m 3G -M q35 -cpu host -smp 4 -hda virtualmachine.img -accel hvf

I then installed pkgin by uncommenting the PKG_PATH line in .profile. I had to change from the default https://cdn.netbsd.org to http://ftp.netbsd.org to get it to work.

I don't know why but pkg_add & pkgin run really slowly (multi-minute to search for or install a package). Not sure if it is a qemu issue or not, everything else seems to run at a sensible speed.

I then installed trippy version 0.8.0:

pkgin install trippy

It runs, though i'm obviously missing some setup as it looks like the following:

Screenshot 2023-08-31 at 11 50 35 PM

I don't see any ICMP traffic being received, which I guess is some qemu config I need to tweak. A standard traceroute seems to have the same problem so I suspect it is config.

Because i'm working directly in the console I don't have any way to trigger the bug with a window "resize", I guess I need some kind of graphical environment?

I wasn't able to ssh into the vm from my Mac host, so i'm just working in the qemu window that pops up.

0323pin commented 1 year ago

Hi, great that you have managed to install it :)

I don't know why but pkg_add & pkgin run really slowly (multi-minute to search for or install a package).

I've never experienced this. Ok, it's not as fast as xbps but, it's not slower than apt. Location/mirror?

Because i'm working directly in the console I don't have any way to trigger the bug with a window "resize", I guess I need some kind of graphical environment?

There are two window managers (kind of anyway, twm and the default ctwm) in the base install, as well as Xorg and three shells.

0323pin commented 1 year ago

@fujiapple852 Are you on Matrix? I'm really short of time today but we could set-up sometime to chat through your issues.

0323pin commented 1 year ago

@fujiapple852 On a second thought ...

If your intention is only to debug the window-resize issue on a disposable qemu-vm, which you do not intend to keep, NetBSD has everything you need on the base install.

Xorg is part of base and ctwm the default WM, so if you just run startx from the tty, you will be on a graphical env A bare-bones one with a white xterm (create .Xresources, if you want to define other colors) but, nevertheless a graphic env 😄

Sorry, if I can't give you proper .xinitrc and .Xresources files but, I haven't used modified defaults in ages. These days, I'm using alacritty built from git-HEAD and not xterm (no need for .Xresources) and elvish also built from git-HEAD as my default shell, configured to start a graphical env straight from login.

If you, by any chance, run into issues with .Xauthority (not able to start the X-server) make sure your /etc/hosts is properly configured. It should contain the proper machine name (the name you gave your host during install) and your DNS domain and it should look like this:

#   $NetBSD: hosts,v 1.9 2013/11/24 07:20:01 dholland Exp $
#
# Host name database.
#
# This file contains addresses and aliases for local hosts whose names
# need to be resolvable during system boot; typically this includes only
# the address and FQDN for this machine's hostname.
#
# By default this file is consulted before DNS, so adding additional
# material here that then becomes out of date can lead to confusion.
# See nsswitch.conf(5).
#
::1         mybox.my.domain mybox
000.0.0.0       mybox.my.domain mybox
#
# RFC 1918 specifies that these networks are "internal":
# 10.0.0.0        -   10.255.255.255  (10/8 prefix)
# 172.16.0.0      -   172.31.255.255  (172.16/12 prefix)
# 192.168.0.0     -   192.168.255.255 (192.168/16 prefix)

Note: I've posted this on a forum a longtime ago so, I've hidden the numbers, 000.0.0.0 is actually something else. Don't touch the default, just fix domain name and hostname.

Now, you should be able to resize your terminal and reproduce the issue.

fujiapple852 commented 1 year ago

Are you on Matrix?

@0323pin I am now! @fujiapple852:matrix.org

fujiapple852 commented 1 year ago

If your intention is only to debug the window-resize issue on a disposable qemu-vm, which you do not intend to keep, NetBSD has everything you need on the base install.

Yeh, disposable is fine. I'd like to be able to fire up trippy on netbsd run a basic test before each release like I used to do on a cloud env.

Xorg is part of base and ctwm the default WM, so if you just run startx from the tty, you will be on a graphical env

Wow, that just...worked :)

From the graphical env I was able to start trippy, resize the window and observe the crash. Nice!

c-git commented 1 year ago

Yeh, disposable is fine. I'd like to be able to fire up trippy on netbsd run a basic test before each release like I used to do on a cloud env.

Is there an easy way to add this to the pre-release CI?

c-git commented 1 year ago

I also joined matrix @one.----:matrix.org

0323pin commented 1 year ago

Yeh, disposable is fine. I'd like to be able to fire up trippy on netbsd run a basic test before each release like I used to do on a cloud env.

Is there an easy way to add this to the pre-release CI?

I know nothing about CIs but, there's now support for NetBSD in https://cirrus-ci.com/build/6221284932583424

fujiapple852 commented 1 year ago

@0323pin provisional fix available in https://github.com/fujiapple852/trippy/pull/670

It seems to fix it in my qemu-vm environment, would you care to try it?

One caveat here is that I've had to temporarily downgrade clap as the latest version requires Rust 1.70 and the latest version available on netbsd 9.3 appears to be 1.69 (the latest rust available is 1.72). Once netbsd bumps to 1.70 I can merge this fix, which is fine as I don't plan a trippy release for a while anyway. If needed I could back port the fix to 0.8.x and release it, but I think that may be overkill.

fujiapple852 commented 1 year ago

I know nothing about CIs but, there's now support for NetBSD in https://cirrus-ci.com/build/6221284932583424

That's awesome!

Now I wonder if it allows raw sockets and ICMP (GH actions do not...)

0323pin commented 1 year ago

Thanks! I'll take it for a spin after the weekend. Sorry, family visit.

Did you see my comment on the Rust version in your branch commit?

fujiapple852 commented 1 year ago

@0323pin yes I did, that's fine. No rush here, we've waited this long (over a year!) it can wait a few more weeks :)

0323pin commented 1 year ago

@fujiapple852 It works 😄

2023-09-02-173905_1366x768_scrot

Built with Rust-1.71.1, resize without crash.

fujiapple852 commented 1 year ago

Thanks! The fix has been Merged and will go into the 0.9.0 release.