Bogdanp / koyo

A web development toolkit for Racket.
https://koyoweb.org
132 stars 12 forks source link

argon2id-hasher: unhandled error: illegal instruction. Some debugging context lost #35

Closed CastixGitHub closed 3 years ago

CastixGitHub commented 3 years ago

Hello, I just wanted to take a look to this project and I got this error trying to sign up

I am using argon2 20190702-3 from the archlinux repositories and racket 8.2 CS I have no idea how to debug this, what instruction is illegal? but is it using the argon2 shared library from the os? (yes, it uses Places https://docs.racket-lang.org/reference/places.html)

Anyway, I see that argon should be one of the hasher available https://koyoweb.org/password-hashing/index.html Is there a wishlist of other hasher? maybe a pure racket implementation should be the perfect fallback, I see that https://pkgs.racket-lang.org/package/crypto-lib have various kdf implementations, such as argon2id, so if the koyo user will set the same parameters of the current implementation there will be no need for a rehashing migration

but does crypto-lib use the shared library from the os?

these are the default parameters, but can be configured, (where? I dont't see them in koyo/config and in proj/config, oh, ok, dynamic.rkt oh wow #:user config:db-username is beautiful there)

;; ;; ;; from koyo/hasher/argon2id.rkt
                   #:parallelism [parallelism (processor-count)]
                   #:iterations [iterations 256]
                   #:memory [memory 2048]))
;; ;; ;; ;; ;; ;; overwritten by the configuration in dynamic.rkt to use always only 2 (threads?)
 [hasher (make-argon2id-hasher-factory
           #:parallelism 2
           #:iterations 256
           #:memory 2048)]

hey but wait, in koyo/hasher/argon2id-place.rkt there is crypto/argon2 required! how is it used? why it uses places then? it uses pwhash and pwhash-verify, but why are them tied to a place channel? is it a security measure? spawning a new racket VM just for that? it's the same process anyway. I can't profile it right now, it waits undefinitely and no answer is given to the client uhm... so, (exn-message e) is illegal instruction. Some debugging context lost oh, now I got what illegal instruction is related to, it means at hardware level, it's a racket bug!? ok, what can I do now? should I use valgrind? I don't see any output after the server starts and after the place starts it hangs (launched with valgrind raco chief start i see memcheck and then the server logs) there are no issues with memory then I suppose what am I trying to fix then?

by looking at similar issues in the racket github, only two of them seem related (but now even so close) to this and are tagged as 'unexplained

is it debuggable with gdb? what's gdbdump racket package? is there a debug helper for such situations? I'm pretty stuck now

;; lscpu
Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        36 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 4
  Model name:           Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

Thank you for reading

Bogdanp commented 3 years ago

[...] but is it using the argon2 shared library from the os?

Yes, but not in this case. The standard blueprint, which is used by default when you create an app, depends on my libargon2 package by default. That provides a distribution of libargon2 for Windows, macOS and Linux. However, the version for Linux is built on Debian so it may not work on other distributions due to differences in dynamic libraries.

What you can do is remove de dependency on libargon2 from your project and run raco pkg remove libargon2 libargon2-x86_64-linux. Once you do that, crypto-lib should try to use your system libargon2, which should work fine.

[...] how is it used? why it uses places then?

All Racket code within a place runs in a single OS thread, which means all web server handlers run in the same OS thread. Since PBKDFs are designed to be compute & memory intensive and "slow", running them on the main place would block other Racket code from running for their duration, which would make the server temporarily unresponsive for other people while someone tries to log in or sign up.

CastixGitHub commented 3 years ago

ok, raco pkg remove --force libargon2 libargon2-x86_64-linux works as you said! thank you

about making a permanent fix... (even because you won't notice it until you actually use the hasher)

$ file ~/.local/share/racket/8.2/pkgs/libargon2-x86_64-linux/libargon2.so
/home/castix/.local/share/racket/8.2/pkgs/libargon2-x86_64-linux/libargon2.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=cfc112848b59ea289e163912f33f6260918cf2dd, stripped
$ file /usr/lib/libargon2.so.1
/usr/lib/libargon2.so.1: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0614a9b6263447832d30020ba63ca1a713858411, stripped

I tried to force copy the sys library from the system to racket packages with $ cp /usr/lib/libargon2.so.1 ~/.local/share/racket/8.2/pkgs/libargon2-x86_64-linux/libargon2.so but the issue is still there

I see that racket-libargon2 doesn't add any magic other than building libargon over docker, deploying the package to racket packages and copies the built so file when installed with raco

then... how do I examine the differences?

am I supposed to add archlinux to your libargon package? well, no, the target is x86_64-linux and debian is just used as the base os for building it. it should just work... what's the illegal instruction?

Thank you also for racket place and web threading explainations

Bogdanp commented 3 years ago

I tried to force copy the sys library from the system to racket packages [...]

You'd have to copy it to ~/.local/share/racket/8.2/lib/libargon2.so, not to the package. When the package is installed, the shared library is copied from it into that lib/ folder. That folder is where Racket searches for shared libraries before looking in the system folders.

then... how do I examine the differences?

My guess is you have a newer version of glibc than the library was built with and there was a backwards-incompatible change in the mean time (the most recent version I see in the Arch repos is 2.33, whereas the lib was built with 2.28 and there have been recorded breaking changes in the mean time). The "illegal instruction" error is probably a red herring and masking the real problem (a symbol lookup failure or something along those lines). If you want to debug this you can write a C program, dynamically link it with the .so from the package and run it in a debugger. It would probably be helpful to build your own .so using the same debian:10 docker image, but w/o stripping the output and w/ debug symbols turned on.

The fix here is probably for me to change the standard blueprint not to include libargon2 by default on Linux and mention in the README that the user should install argon2 from their package manager and that the Racket package is available but will likely only work on Debian derivatives.

CastixGitHub commented 3 years ago

Thank you so much! I finally did understand the issue: I am using a ivy bridge processor that doesn't have avx2 support https://en.wikipedia.org/wiki/Advanced_Vector_Extensions to detect it, use lscpu | grep avx2 and if you see output your cpu is ok. I noticed it through gdb

#0  0x00007fd881eb8028 in ?? () from /home/castix/.local/share/racket/8.2/lib/libargon2.so
#1  0x00007fd881eb75d5 in ?? () from /home/castix/.local/share/racket/8.2/lib/libargon2.so
#2  0x00007fd881eb7543 in ?? () from /home/castix/.local/share/racket/8.2/lib/libargon2.so
#3  0x00007fd881eb62f7 in argon2_ctx () from /home/castix/.local/share/racket/8.2/lib/libargon2.so
#4  0x00007fd881eb6442 in argon2_hash () from /home/castix/.local/share/racket/8.2/lib/libargon2.so
#5  0x00007fd881eb662f in argon2id_hash_encoded ()
   from /home/castix/.local/share/racket/8.2/lib/libargon2.so
#6  0x0000000050093204 in ?? ()
#7  0x0000000000000010 in ?? ()
#8  0x0000000000000020 in ?? ()
#9  0x0000000044212668 in ?? ()
#10 0x0000000000000063 in ?? ()
#11 0x00005620362be680 in ?? ()
#12 0x00005620353c9c03 in S_call_help ()
#13 0x00005620353c9d52 in Scall0 ()
#14 0x00005620353caf55 in ?? ()
#15 0x00007fd881d39259 in start_thread () from /usr/lib/libpthread.so.0
#16 0x00007fd881c625e3 in clone () from /usr/lib/libc.so.6

(gdb) x/i $pc
=> 0x7fd881eb8028:  vpxor  (%rsp),%ymm5,%ymm0

vpxor here is part of avx2

to the future me: I attached gdb to the already running PID with gdb -p 140247 (that's a big number... is something flappy on my system? a week of uptime) and I also had to do echo 0 > /proc/sys/kernel/yama/ptrace_scope as root to remove the protection that blocked gdb and I also used LD_PRELOAD to be sure about what libargon2.so was used by raco

so it's not the debian build so that doesn't work on arch, it doesn't work on my cpu, I accept to take the binary from the os, and you keep your optimized? :)

bonus:

When the package is installed, the shared library is copied from it into that lib/ folder. uh, this means it uses the double of space for so files, I mean, 36KB in this case and what's the purpose of libs.rktd there?

bonus2: is it possible to do something to notice the user browser when the thread that should reply is stuck by SIGILL and other signals? can doing this be an introduction to the project? I see it logs the exception, so the signal is already managed, and the logger call should be in the same thread that replies, it loops in hasher-start, so in the define-system macro there are middlewares (wrong word? components!) where the hasher is obtained from a factory that is implemented as a contract that evaluates to the structure of the component. well done, thank you, but there is not 500 page in the common pages blueprint, then what about the koyo/error that displays the beautiful traceback? current-production-error-page that's it, now, how can I hack koyo installed through raco if it probably gets compiled... i also cloned the git repo, so, an option for raco, --link now it seems I can't just use (redirect-to (reverse-uri 'current-production-error-page)) after unhandled error is logged, because that's not how it works, and normal cases are already handled automagically from wrap-errors in the app stack blueprint, but this one is lost other than odd (maybe in another issue, but properly described, strokedout text because I were definitely in the wrong path (and you already said me that by all web server handlers run in the same OS thread.))

Edit: You can mark as closed if you wish

Bogdanp commented 3 years ago

I am using a ivy bridge processor that doesn't have avx2 support

I never would have guessed that that was the problem!

is it possible to do something to notice the user browser when the thread that should reply is stuck by SIGILL and other signals?

I don't think a place can recover from these sorts of errors. IIRC, at the ChezScheme level the SIGILL signal is handled, an error is reported and then it aborts; though since the main place doesn't crash, it must only stop the place thread rather than abort. Maybe the main place could detect that the hashing place died with place-dead-evt and report a better error to the console. For example, in try-start-hasher-place! (argon2id.rkt), after starting the place we could do something like

        (define ch (hasher-start))
        (thread
         (lambda ()
           (sync (place-dead-evt ch))
           (when (argon2id-hasher-ch h)
             (log-error "hasher place died"))))
...

Let me know if you want to give that a try. If not, I'll do it when I have some more free time.

Bogdanp commented 3 years ago

Would you mind giving 9a3ec7c88b6782e107d87633fa26f45b47bc5939 a try to see if it reports an error when the SIGILL is triggered?

CastixGitHub commented 3 years ago
[2021-08-25 09:47:28.281] [   25887] [   info] server: listening on 127.0.0.1:8000
[2021-08-25 09:47:28.283] [   25877] [   info] runner: application process started with pid 25887
illegal instruction.  Some debugging context lost
  context...:
   /usr/share/racket/collects/racket/contract/private/arrow-val-first.rkt:489:18
   .../racket/place.rkt:41:31: main

I don't see any big difference, The exception is anyway trapped from argon2id-place.rkt and it loops there for new events, the place doesn't just get killed. what about the libargon racket package looking for cpu flags and raising an installation error if avx2 is not available? or just turn off the compiler flag on that docker builder so nobody will experience this? Thank you again

Bogdanp commented 3 years ago

Thanks for trying it out! I didn't realize the place wasn't being killed. That makes sense now. I'll see what I can do about the libargon2 package next week.