Olical / conjure

Interactive evaluation for Neovim (Clojure, Fennel, Janet, Racket, Hy, MIT Scheme, Guile, Python and more!)
https://conjure.oli.me.uk
The Unlicense
1.77k stars 108 forks source link

v4.7.0 crashes neovim on ARM64 #124

Closed meinside closed 3 years ago

meinside commented 3 years ago

Hi, after updating to v4.7.0, neovim crashes nearly 9 times out of 10 with following message:

nvim: lj_record.c:119: rec_check_slots: Assertion `(((((tr)) & (IRT_TYPE<<24)) == ((IRT_FUNC)<<24)))' failed.
[1]    6019 abort      nvim FILENAME

After crash, I can see a node process left running:

6021 ?        Ssl    0:00 /opt/node/bin/node --no-warnings /home/username/.local/share/nvim/plugged/coc.nvim/build/index.js

but I'm not sure if it is related to coc.nvim or not.

File types don't matter.

It works okay on older versions (like v4.6.0 or v4.5.0), but crashes like this on v4.7.0.

Is there any way I can check more about this issue?

Thanks anyway for your great work :-)

Olical commented 3 years ago

Oh wow, I'm sorry you're experiencing this, I've never seen anything like it! It's probably worth trying out the develop branch, just in case. There were some issues with Fennel + LuaJIT recently that got patched out, I'm not 100% sure if any of those issues are present in the version of Fennel bundled with Conjure 4.7.0.

I've never tried running Conjure on ARM, not that I do anything low enough level for that to be significant, but still.

It it crashing on launch? Or after a while of editing? And do you have to be in a Clojure file (or any file that triggers Conjure) for it to happen?

meinside commented 3 years ago

It crashes on launch, with any type of file(even when no file is given).

Interesting thing is, I have two Raspberry Pis, but the crash happens only on 64bit Raspberry Pi OS, and 32bit one has no such problem.

I'll try develop branch and let you know :-)

Olical commented 3 years ago

Another thing you can try, cd into the Conjure repo and run this with luajit installed:

for i in {1..100}; do lua -e "require 'lua.conjure.aniseed.deps.fennel'"; done

If that's failing sometimes (which I had a while back in LuaJIT but with another error) then it'll prove it's something to do with Fennel at least.

meinside commented 3 years ago

I tried running your script several times, but I couldn't see any output. (So no problem with it?)

And also tested with develop branch, but no gain so far.

Olical commented 3 years ago

Yep, if there's no output it's all good, which kinda rules out LuaJIT + Fennel issues. I was hoping that'd be the issue :sweat_smile:

I've got a bunch of work to do today, but I'll compare the diff between versions over the weekend / when I get some time for OSS, I have no idea what could cause this, but I'm sure we can find it.

Maybe it's time for a git bisect!

meinside commented 3 years ago

No need to hurry at all! I will stick with previous versions.

Please let me know anytime when you need a tester :-)

meinside commented 3 years ago

Bisected on my machine, 04c45bf16f6cef4e574aa1278f34d05949af610e seems to be the first bad commit.

Olical commented 3 years ago

Oh that's really interesting! I feel like it'll be the Fennel compiler somehow but I'm not sure yet, maybe I can run an ARM machine in AWS or in a container to try and reproduce it :thinking: which LuaJIT version did you try my repeated requires with?

Because ideally we'd be using the same LuaJIT as the one embedded in Neovim.

$ lua -v
LuaJIT 2.1.0-beta3 -- Copyright (C) 2005-2020 Mike Pall. http://luajit.org/

$ nvim -v
NVIM v0.4.4
Build type: Release
LuaJIT 2.1.0-beta3
Compilation: 
Compiled by nixbld

Features: +acl +iconv +tui
See ":help feature-compile"

   system vimrc file: "$VIM/sysinit.vim"
  fall-back for $VIM: "
/nix/store/iq47sm00ykdbpfjm38bwv6xw84glmf7d-neovim-unwrapped-0.4.4/share/nvim
"

Run :checkhealth for more info
meinside commented 3 years ago

Sorry I'm late.

Here go the versions:

$ lua -v
Lua 5.1.5  Copyright (C) 1994-2012 Lua.org, PUC-Rio

$ luajit -v
LuaJIT 2.1.0-beta3 -- Copyright (C) 2005-2017 Mike Pall. http://luajit.org/

$ nvim -v
NVIM v0.5.0-784-gc6ccdda26
Build type: RelWithDebInfo
LuaJIT 2.1.0-beta3
Compilation: /usr/bin/cc -O2 -g -Og -g -Wall -Wextra -pedantic -Wno-unused-parameter -Wstrict-prototypes -std=gnu99 -Wshadow -Wconversion -Wmissing-prototypes -Wimplicit-fallthrough -Wvla -fstack-protector-strong -fno-common -fdiagnostics-color=always -DINCLUDE_GENERATED_DECLARATIONS -D_GNU_SOURCE -DNVIM_MSGPACK_HAS_FLOAT32 -DNVIM_UNIBI_HAS_VAR_FROM -DMIN_LOG_LEVEL=3 -I/tmp/nvim/build/config -I/tmp/nvim/src -I/tmp/nvim/.deps/usr/include -I/usr/include -I/tmp/nvim/build/src/nvim/auto -I/tmp/nvim/build/include
Compiled by meinside@mymachine

Features: +acl +iconv +tui
See ":help feature-compile"

   system vimrc file: "$VIM/sysinit.vim"
  fall-back for $VIM: "/usr/local/share/nvim"

Run :checkhealth for more info

$

There's also no output with for i in {1..100}; do luajit -e "require 'lua.conjure.aniseed.deps.fennel'"; done.

Olical commented 3 years ago

I still have nothing to go on here :sob: I was googling around for parts of this error and found an email thread from 2010 saying that it was fixed on LuaJITs master branch. They were doing GDB things to try and find the point where LuaJITs internal state got broken and it was something to do with vararg functions.

I don't have high hopes of solving this right now since it feels like a LuaJIT bug :disappointed:

Olical commented 3 years ago

If I can bisect my way down to a minimal reproduction to work out which file is doing this then we might stand a chance :thinking:

I need to have a compare between 4.5.0 and 4.7.0 too, I forgot that it used to work for you. That's an important fact.

Olical commented 3 years ago

This is a long shot, but does v4.9.0 work okay? I'm kinda guessing not, but it'd be nice to see if the recent changes to Fennel magically avoid this bug.

At the moment I'm still at a loss, I'm not doing anything too clever here, LuaJIT just has a bug somewhere and I don't know which chunk of Lua is causing it to fail :disappointed: it may also be to do with Neovim's LuaJIT interop code, as the call flows in and out of C -> LuaJIT -> C -> LuaJIT etc.

meinside commented 3 years ago

bcdaf37 doesn't make much difference :'( Also tested with the nightly version of nvim, but no gain.

I think this is not a major issue, so please keep going ignoring this issue for a while :-) I'll let you know through this issue whenever I see any change or progress.

harryvederci commented 3 years ago

@meinside did you manage to find a workaround? I just upgraded from v4.3.1 to v4.10.0, and I'm having the same issue on my Chromebook.

v4.6.0 is still an upgrade for me, so I'll work with that version for now.

Olical commented 3 years ago

As far as I know it's a LuaJIT bug that's extremely subtle. I think it's some code in the Fennel compiler that's doing this but I can't be sure, it's just the largest file so I guess it's the most likely spot.

The only time I saw something like this before the fix was to have a function inside the Fennel compiler written in a slightly different way. There was no fix, it was just shuffle the code around until LuaJIT was happy :cry:

So I think it'll be the same, but I don't know where, so I have a LOT of Lua where one part of it may be tripping up Lua, but in a non-obvious way.

I feel like the only way to tackle this is with some very clever valgrind usage, which scares me :sweat_smile:

Was it crashing as you opened Neovim? As in just as you executed nvim without opening any files?

harryvederci commented 3 years ago

Thanks for looking in to this!

Installed v4.10.0 again to reproduce the issue. Some findings:

meinside commented 3 years ago

It's both good and sad to see another one with the same problem :-)

Has anyone tested it with the newly released M1 MacBooks? I don't have one, but if ARM64 is the problematic environment, it could also happen there.

harryvederci commented 3 years ago

I just installed Conjure v4.14.1 and it seems like everything is working fine again! (Both on a Chromebook and my Raspberry Pi 4.)

@meinside can you confirm?

PS: I didn't try any versions after v4.10.0, so it could be that this was already fixed before.

Olical commented 3 years ago

I'm not sure if that's a good or bad thing since I haven't done anything specific :grimacing: I can only imagine the Lua that's being generated / run now doesn't hit the same LuaJIT bug. For now, YAY. I just hope it doesn't float back in... maybe some CI for this could be good :thinking:

meinside commented 3 years ago

@harryvederci Thanks for pinging me, but it's still the same for me. Which version of nvim are you using?

meinside commented 3 years ago

Yeah! It's working okay with the nightly version of nvim!

Olical commented 3 years ago

So this is as I suspected I guess, some combinations of Lua code hit a bug in LuaJIT. All I can do (#138) is introduce ARM CI to catch this early in the future, then maybe I can work out what commits cause it. I doubt the root cause is fixable easily, but we can at least know when it's slunk back into the develop branch via CI.

Going to close this since it's ephemeral and just add tests that will hopefully catch it for the future! Thank you for the diligent reporting, I hope it doesn't cause you issues in the future and you can happily Conjure without interruption :smile: