fish-shell / fish-shell

The user-friendly command line shell.
https://fishshell.com
Other
26.21k stars 1.92k forks source link

test failure on armel architecture #9584

Closed cdluminate closed 1 year ago

cdluminate commented 1 year ago

fish version: 3.6.0
architecture: armel
builddlog: https://buildd.debian.org/status/fetch.php?pkg=fish&arch=armel&ver=3.6.0-1&stamp=1675800903&raw=0

The following tests FAILED:
      2 - dir_iter (Failed)
      9 - highlighting (Failed)
    170 - redirect.fish (Failed)
    212 - bind.py (Failed)

However, the build for 3.5.1 is always fine. I went through the diff between 3.5.1 and 3.6.0 -- there does not seem to be any arm specific code. So I'm confused.

faho commented 1 year ago

First of all: Try rerunning these.

212 - bind.py (Failed)

That's gonna be a fluke - the expect tests have always been extremely flaky on the typical build machines because they expect a high degree of interactivity (i.e. low latency).

170 - redirect.fish (Failed)

This one crashes somewhere between line 28 and 40

2 - dir_iter (Failed)

This one creates a few files and then checks them for the expected properties. E.g. it creates a directory and then checks if it's a directory, to exercise the "dir_iter" type.

This one is reasonably new (I think it's new in 3.6.0), but the checks aren't anything to write home about:

https://github.com/fish-shell/fish-shell/blob/af833a700d35e2d2f735e7fffca6bc9415aa01fb/src/wutil.cpp#L62-L114

We do stat(2), and then we extract the type information from the st_mode field, or we use readdir(2) and do it with the d_type field, if that exists.

I don't think there's any type-fuckery here, on my system the manpage says d_type is an unsigned char, we use uint8_t, that should line up (char being 8 bits because posix).

9 - highlighting (Failed)

This crashes in some way tokenizing, impossible to tell how from the information it gives.


Yeah, this is one of those where I'm gonna ask you if it's possible your test machine has gremlins, or if you could try with a newer compiler - this one still complains about gcc 7 changes.

We do have arm CI machines ("armv7-32bit" and "arm64", and of course macOS), and I regularly run the tests on a raspberry pi, so this would have to be a weird special case down to older or different arm hardware.

cdluminate commented 1 year ago

FYI: all the historical buildlogs can be found here: https://buildd.debian.org/status/package.php?p=fish Just click the all (6) for the armel architecture and you will find the 6 buildlogs (in Maybe-Failed).

cdluminate commented 1 year ago

armel is a very old arm baseline that does not even have hardware float point support (yes ... even older than armv7-32bit which is armhf in debian). The compiler is GCC-12 even though it complains about gcc7 changes.

I realized that all the failed tests seem to be dependent on IO operations, this can be flaky on very old hardware. May I assume it is safe to ignore these several failures armel for now, as we are close to the freeze of the next debian stable?

faho commented 1 year ago

What the fudge, it crashes in bind.py?

[N] prompt 12>echo MORE-TEXT-IS-NICEterminate called after throwing an instance of 'std::out_of_range' what(): basic_string::at: __n (which is 10297543) >= this->size() (which is 22)

Yeah, that could be related to the highlighting crash.

May I assume it is safe to ignore these several failures armel for now, as we are close to the freeze of the next debian stable?

I can't tell you it's "safe", as I have no idea if fish runs here or not.

I can tell you that I don't particularly care about this hardware and I know it works on newer. I also don't believe we have the capability to reproduce this.

cdluminate commented 1 year ago

Thank you for the timely answer. We can ship with fish 3.5.1 for the next debian stable. The release team won't be happy if I want to discard a release architecture, i.e., armel shortly before the freeze in order to let fish 3.6.0 migrate. But when the freeze period is over, I think it makes full sense to drop this old architecture since nobody so far is interested in it.

faho commented 1 year ago

You mean you want to ship fish 3.5.1 for armel or for all architectures?

cdluminate commented 1 year ago

I meant 3.5.1 for all architectures (due to armel being the sole blocker). Debian's migration policy requires that a version in unstable (fish 3.6.0) must compile on all release architectures (including amd64, arm64, armhf, armel, etc) before it can migrate into debian testing (which is going to be the next stable).

faho commented 1 year ago

Okay, I'm sorry, I'm not gonna sugarcoat it: That policy sucks.

This is hurting all of your users to placate some theoretical users on armel. I have never heard of this being an actual practical bug for anyone.

Fish 3.6.0 is a big and important release with 780 commits - ~4.7% of all commits ever made to fish.

I would like to ask you to reconsider. Dropping fish for armel seems vastly preferable than shipping a version that's already out of date.

cdluminate commented 1 year ago

I understand what you meant. Let me try to find some workaround or discuss it with the release team.

mqudsi commented 1 year ago

I've been tracking the highlight test failure for a few months since it was first introduced in our OBS builds. It was shortly after I fixed a different armv6l regression caused by a compiler bug and I didn't feel like digging out qemu all over again and waiting for hours for it to compile and test at the time. For what it was worth, the test failures were introduced very late in the 3.6.0 dev cycle.

workingjubilee commented 1 year ago

Does this repro when the armel fish is built with clang++?

cdluminate commented 1 year ago

FYI, This failure can be reproduced with qemu. And, after switching the compiler to clang++14, the original issues are gone, but along with another set of failures:

The following tests FAILED:
    120 - fds.fish (Failed)
    177 - sigint.fish (Failed)
    242 - torn_escapes.py (Failed)
Errors while running CTest

shrug.

mqudsi commented 1 year ago

I wanted to try and fix these for 3.6.1 but it turns out they're not just just compiler quirks. I'm ignoring any issues with fish 3.6.0 (because that ship has sailed) and focusing on fish 3.6.1. The following tests fail w/ gcc 8.3.0 on armel under qemu:

As far as I can tell, these are all real failures and fail from when they were introduced - I'm testing under qemu though.

zanchey commented 1 year ago

I have tried to acquire some armel hardware but the lack of a Nokia charger has stymied my efforts.

misuzu commented 1 year ago

Some tests also fail on armv7l-linux for some reason: https://hydra.armv7l.xyz/build/5228/nixlog/1

98% tests passed, 5 tests failed out of 245

Total Test time (real) =  67.69 sec

The following tests FAILED:
        211 - abbrs.py (Failed)
        215 - commandline.py (Failed)
        217 - complete.py (Failed)
        234 - read.py (Failed)
        245 - wildcard_tab.py (Failed)
Errors while running CTest
mqudsi commented 1 year ago

All tests passed for me on emulated vexpress armv7l as of a couple weeks ago with both clang and gcc on the Integration_3.6.1 branch. What branch were you testing and with which compiler?

misuzu commented 1 year ago

fish 3.6.0 & gcc 12.2.0, log with relevant info is here: https://hydra.armv7l.xyz/build/5228/nixlog/1

mqudsi commented 1 year ago

All your failures are because the en_US.UTF-8 locale is not installed. That's a requirement for the test suite.

misuzu commented 1 year ago

It fails in the same way even with locales: https://paste.debian.net/1274523/

mqudsi commented 1 year ago

@zanchey I was really confused how the tests were failing even when rolling back to old versions even though I remember OBS being clear at some times. I turns out that at least two test failures arise from chrooting via qemu instead of fully emulating the hardware (as the same tests fail when chrooting into a armv7l qemu but pass when fully emulating armv7 via vexpress chipset with qemu):

120 - fds.fish (Failed)
177 - sigint.fish (Failed)

That only leaves noshebang.fish

misuzu commented 1 year ago

It fails in the same way even with locales: https://paste.debian.net/1274523/

Tested with gcc 10.4.0 - same issue. With clang 11.1.0 all tests passed though.

mqudsi commented 1 year ago

@misuzu can you build the 3.6.1 branch? Because that builds and passes all tests for me with GCC 8.3.0 under Linux 5.10.0/deb10.16-armmp-lpae.

faho commented 1 year ago

@misuzu: Could you run these tests under gdb?

It might be tricky to get the pexpect to run so it captures fish, but given that these things crash on a lot of commandline interactions it should be easy to reproduce manually - just run an interactive fish with gdb attached, press a few buttons and it'll probably crash in a couple of prompts.

E.g. here's what complete.py does:

> complete -x -c monster -a truck
> complete -x -c monster -a energy=
> monster t # and now press TAB and then "!"

and I am reasonably sure that third part isn't even fully reached. That's a "within minutes of use" crash.

That only leaves noshebang.fish

@mqudsi I would have absolutely zero qualms to disable posix_spawn up to glibc 2.30 or something. It's a performance enhancement.

mqudsi commented 1 year ago

@faho yeah, I thought about it. But I don't think it's a broken glibc issue, 2.30 isn't that old. Ubuntu 18.04 LTS is on 2.31 and there are no issues there. I don't mind disabling it for armel, though. But let me see if that actually fixes it first.

misuzu commented 1 year ago

@misuzu: Could you run these tests under gdb?

% /nix/store/51bxnr1smv4alz9pc5abwz060vp6xmkk-gdb-13.1/bin/gdb /nix/store/n5dfwr5n0cg47v6r2zcqj783pz0ssbss-fish-3.6.0/bin/fish
GNU gdb (GDB) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv7l-unknown-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /nix/store/n5dfwr5n0cg47v6r2zcqj783pz0ssbss-fish-3.6.0/bin/fish...
(No debugging symbols found in /nix/store/n5dfwr5n0cg47v6r2zcqj783pz0ssbss-fish-3.6.0/bin/fish)
(gdb) set print thread-events off
(gdb) run
Starting program: /nix/store/n5dfwr5n0cg47v6r2zcqj783pz0ssbss-fish-3.6.0/bin/fish
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/xqqcl5rrcb327lql7c2s7b5i6mwidy7v-glibc-2.35-224/lib/libthread_db.so.1".
[Detaching after vfork from child process 1388912]
Welcome to fish, the friendly interactive shell
Type help for instructions on how to use fish
[Detaching after vfork from child process 1388914]
[Detaching after vfork from child process 1388915]
[Detaching after vfork from child process 1388916]
[Detaching after vfork from child process 1388917]
[Detaching after vfork from child process 1388918]
misuzu@mia ~/nixpkgs ((bf209075))> complete -x -c monster -a truck
[Detaching after vfork from child process 1389041]
[Detaching after vfork from child process 1389043]
[Detaching after vfork from child process 1389044]
[Detaching after vfork from child process 1389045]
misuzu@mia ~/nixpkgs ((bf209075))> complete -x -c monster -a energy=
[Detaching after vfork from child process 1389097]
[Detaching after vfork from child process 1389099]
[Detaching after vfork from child process 1389100]
[Detaching after vfork from child process 1389101]
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 4294888284) > this->size() (which is 14)

Thread 1 "fish" received signal SIGABRT, Aborted.
0xf7ba13e4 in __pthread_kill_implementation () from /nix/store/xqqcl5rrcb327lql7c2s7b5i6mwidy7v-glibc-2.35-224/lib/libc.so.6
(gdb)

(fish 3.6.0 & gcc 12.2.0)

What else should I do?

misuzu commented 1 year ago

@misuzu can you build the 3.6.1 branch? Because that builds and passes all tests for me with GCC 8.3.0 under Linux 5.10.0/deb10.16-armmp-lpae.

3.6.1 fails with gcc 12.2.0. I'll test with gcc 8.5.0 when it's done compiling.

mqudsi commented 1 year ago

in gdb, after the crash execute bt

misuzu commented 1 year ago

in gdb, after the crash execute bt

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 4294888284) > this->size() (which is 14)

Thread 1 "fish" received signal SIGABRT, Aborted.
0xf7ba13e4 in __pthread_kill_implementation () from /nix/store/xqqcl5rrcb327lql7c2s7b5i6mwidy7v-glibc-2.35-224/lib/libc.so.6
(gdb) bt
#0  0xf7ba13e4 in __pthread_kill_implementation () from /nix/store/xqqcl5rrcb327lql7c2s7b5i6mwidy7v-glibc-2.35-224/lib/libc.so.6
#1  0xf7b5a748 in raise () from /nix/store/xqqcl5rrcb327lql7c2s7b5i6mwidy7v-glibc-2.35-224/lib/libc.so.6
#2  0xf7b44694 in abort () from /nix/store/xqqcl5rrcb327lql7c2s7b5i6mwidy7v-glibc-2.35-224/lib/libc.so.6
#3  0xf7db2414 in __gnu_cxx::__verbose_terminate_handler() () from /nix/store/wmmd30793ziiazzmqz9n2rz0r8xi3c7q-gcc-12.2.0-lib/lib/libstdc++.so.6
#4  0xf7daff58 in __cxxabiv1::__terminate(void (*)()) () from /nix/store/wmmd30793ziiazzmqz9n2rz0r8xi3c7q-gcc-12.2.0-lib/lib/libstdc++.so.6
#5  0xf7daffe4 in std::terminate() () from /nix/store/wmmd30793ziiazzmqz9n2rz0r8xi3c7q-gcc-12.2.0-lib/lib/libstdc++.so.6
#6  0xf7db0338 in __cxa_throw () from /nix/store/wmmd30793ziiazzmqz9n2rz0r8xi3c7q-gcc-12.2.0-lib/lib/libstdc++.so.6
#7  0xf7dda878 in std::__throw_out_of_range_fmt(char const*, ...) () from /nix/store/wmmd30793ziiazzmqz9n2rz0r8xi3c7q-gcc-12.2.0-lib/lib/libstdc++.so.6
#8  0x000b6b08 in std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::_M_check (__s=0x189f84 "basic_string::substr", __pos=<optimized out>, this=0x22dbd4)
    at /nix/store/qy42m80lr7q3was7ss0xlwrfrx8q2585-gcc-12.2.0/include/c++/12.2.0/bits/basic_string.h:382
#9  std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::substr (__n=<optimized out>, __pos=<optimized out>, this=0x22dbd4)
    at /nix/store/qy42m80lr7q3was7ss0xlwrfrx8q2585-gcc-12.2.0/include/c++/12.2.0/bits/basic_string.h:3124
#10 editable_line_t::push_edit (this=this@entry=0x22dbd4, edit=..., allow_coalesce=<optimized out>) at /build/fish-3.6.0/src/reader.cpp:295
#11 0x000b8b5c in reader_data_t::insert_string (this=0x22db70, el=0x22dbd4, str=...) at /build/fish-3.6.0/src/reader.cpp:487
#12 0x000c2404 in reader_data_t::insert_char (c=<optimized out>, el=0x22dbd4, this=0x22db70) at /build/fish-3.6.0/src/reader.cpp:845
#13 reader_data_t::readline[abi:cxx11](int) (this=this@entry=0x22db70, nchars_or_0=nchars_or_0@entry=0) at /build/fish-3.6.0/src/reader.cpp:4508
#14 0x000c2ae8 in read_i (parser=...) at /build/fish-3.6.0/src/reader.cpp:3201
#15 reader_read (parser=..., fd=fd@entry=0, io=...) at /build/fish-3.6.0/src/reader.cpp:4749
#16 0x00019568 in main (argc=<optimized out>, argv=0xfffed338) at /build/fish-3.6.0/src/fish.cpp:576
(gdb)
misuzu commented 1 year ago

@misuzu can you build the 3.6.1 branch? Because that builds and passes all tests for me with GCC 8.3.0 under Linux 5.10.0/deb10.16-armmp-lpae.

3.6.1 fails with gcc 12.2.0. I'll test with gcc 8.5.0 when it's done compiling.

With gcc 8.5.0 everything is fine on 3.6.0 and a1f79b3accbc19a985454ae2ee4b303a3b8b8254

faho commented 1 year ago

cc @krobelus, who's done the edit/undo stuff.

This points to (adjusted by one line to match current master):

https://github.com/fish-shell/fish-shell/blob/14d6b1c3dea59b8a8c3836f4d270f1d5c33c57db/src/reader.cpp#L296

as the source of this particular crash:

terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr: __pos (which is 4294888284) > this->size() (which is 14)

If I'm reading it right, edit.offset is unset (it's near 32-bit-max).

Could we harden this by inserting a length/offset check and just refuse to do anything? How would this even happen?

mqudsi commented 1 year ago

So after I patched allow_posix_spawn() to return false, noshebang.fish passed. Except after that point, I couldn't get the test to fail even after reverting my patch. My best guess is that I ran into a qemu oddity, though I don't know if that means I should apply the armv6 no posix spawn patch or not.

mqudsi commented 1 year ago

How would this even happen?

My guess is a compiler bug (but why don't I get this w/ gcc 8.3.0?) caused by passing edit_t by value. It would be interesting to see if it disappears if you change that to pass by ptr or reference.

faho commented 1 year ago

Okay I'm going to close this one for lack of activitity and interest (and because it'll be completely irrelevant once the C++ is replaced)