facebookexperimental / hermit

Hermit launches linux x86_64 programs in a special, hermetically isolated sandbox to control their execution. Hermit translates normal, nondeterministic behavior, into deterministic, repeatable behavior. This can be used for various applications, including replay-debugging, reproducible artifacts, chaos mode concurrency testing and bug analysis.
Other
1.17k stars 31 forks source link

Support go hello world at least #29

Closed MarcoPolo closed 1 year ago

MarcoPolo commented 1 year ago

Describe the bug When running a simply "Hello World" Go program, the program crashes before doing anything.

Indicate any of these common scenarios that apply:

To Reproduce Minimal input to reproduce the behavior. Build a hello world Go program (example: https://go.dev/play/). Run that program with hermit.

Expected behavior A clear and concise description of what you expected to happen. It should run.

Environment

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz stepping : 1 microcode : 0xb000040 cpu MHz : 2300.092 cache size : 46080 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_stale_data bogomips : 4600.12 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

Additional context Attach the logs to this issue as a text file generated by hermit --log=trace --log-file=FOO run. https://gist.github.com/MarcoPolo/b58ddebb1655830a770fd29a651dec06

Also happens on NixOS.

jasonwhite commented 1 year ago

Thanks for submitting this issue! The bug in the issue title should be fixed as of b6147fc9bb4feb68cacf80731ee8d7a25e46a515, but I don't actually see this bug as the culprit in your log output. I am able to reproduce the Go hello-world failure, but I get a different error:

fatal error: sigaction failed

We don't have much Go code internally, so we neglected to test this. The Go runtime seems to be calling sigaction early on start up and we're not handling it properly.

We're looking into it :)

MarcoPolo commented 1 year ago

Ah a red herring! thanks! Looking forward to playing with hermit

rrnewton commented 1 year ago

Here's an strace -Cf of go hello world:

strace: Process 2026443 attached
strace: Process 2026444 attached
strace: Process 2026445 attached
strace: Process 2026446 attached
strace: Process 2026447 attached
Hello world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 27.90    0.001892         157        12         1 futex
 24.45    0.001658          97        17           nanosleep
  9.17    0.000622           5       114           rt_sigaction
  7.37    0.000500          15        33           rt_sigprocmask
  6.47    0.000439          12        36           mmap
  6.31    0.000428          85         5           clone3
  5.03    0.000341          34        10           munmap
  3.52    0.000239          13        18           mprotect
  2.99    0.000203         203         1           execve
  1.27    0.000086           7        12           sigaltstack
  0.97    0.000066          11         6           set_robust_list
  0.93    0.000063           5        12        10 openat
  0.58    0.000039           6         6           gettid
  0.56    0.000038          38         1           tgkill
  0.55    0.000037          37         1           rt_sigreturn
  0.37    0.000025           8         3           brk
  0.29    0.000020           6         3           fcntl
  0.29    0.000020           2        10         9 newfstatat
  0.16    0.000011           5         2           read
  0.16    0.000011           5         2           close
  0.15    0.000010          10         1           getrandom
  0.13    0.000009           9         1         1 access
  0.12    0.000008           8         1           sched_getaffinity
  0.07    0.000005           5         1           getpid
  0.07    0.000005           5         1           prlimit64
  0.04    0.000003           3         1           arch_prctl
  0.04    0.000003           3         1           set_tid_address
  0.00    0.000000           0         1           write
  0.00    0.000000           0         4           pread64
------ ----------- ----------- --------- --------- ------------------
100.00    0.006781          21       316        21 total

It doesn't seem all that bad, and we are already (attempting to) handle rt_sigaction, but it must be tickling it in a way we don't properly support. In particular it makes a LOT of sigaction calls, seeming to want to cover its bases on all potential signals.

Some of these are quick fixes -- it would require knocking down the first couple issues to get a better idea. At some point, I'd like to post a guide for "how to debug/fix compatibility issues" in case any one else wants to give it a try before we get to it.

rrnewton commented 1 year ago

In particular, about 30 rt_sigaction calls in, we are returning an EINVAL which it doesn't like:

$ hermit --log=info run  hello_world_go_main
...
2022-11-30T14:25:21.255027Z  INFO detcore: DETLOG [syscall][detcore, dtid 3] finish syscall #109: rt_sigaction(16, NULL, 0x7fffffffd110, 8) = Err(Errno(EINVAL))
2022-11-30T14:25:21.255268Z  INFO detcore: DETLOG [syscall][detcore, dtid 3] inbound syscall: rt_sigaction(16, NULL, 0x7fffffffd408, 8) = ?
2022-11-30T14:25:21.255295Z  INFO detcore: DETLOG [syscall][detcore, dtid 3] finish syscall #110: rt_sigaction(16, NULL, 0x7fffffffd408, 8) = Err(Errno(EINVAL))

Those same calls under strace:

rt_sigaction(SIGSTKFLT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGSTKFLT, {sa_handler=0x2ed0e0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f100ca445a0}, NULL, 8) = 0

This is "Stack fault on coprocessor (unused)". We took the liberty of using this (unused) signal for our own internal purposes. It seems that Go is paranoid and wants to register signal handlers for every possible signal, even unused ones.

I think it's pretty safe for us to just turn this into a "noop" but then just never deliver any signals of that kind to the guest. An easy fix.

rrnewton commented 1 year ago

Ok, actually that was a one line fix, when D41615308 (d8f506394736b3b200785874be672d7fae8861fa) lands and gets synced, hello world will work at least:

$ hermit run  hello_world_go_main
Hello world!

There may be other lurking issues with go. Other languages we've worked with (Rust, C++), use libc, and Go manages its own interactions with the kernel via raw syscalls, so it may do some things differently than libc.

Also, @MarcoPolo, one thing that's kind of an open issue that we're curious about is how hermit run --chaos will work with language level threads like go routines. I don't know if (1) there's a runtime mode that forces every go routine to get its own OS thread (that would be ideal), or (2) if there's enough random work stealing that controlling the runtime RNG is enough to get good coverage of possible schedules.

MarcoPolo commented 1 year ago

Also, @MarcoPolo, one thing that's kind of an open issue that we're curious about is how hermit run --chaos will work with language level threads like go routines. I don't know if (1) there's a runtime mode that forces every go routine to get its own OS thread (that would be ideal), or (2) if there's enough random work stealing that controlling the runtime RNG is enough to get good coverage of possible schedules.

I think with Go at least there may be enough random work stealing that you could explore some of the possible schedules. A very quick experiment running:

package main

import (
        "fmt"
        "sync"
        "time"
)

func execute(id int) {
        fmt.Printf("start id: %d\n", id)
        time.Sleep(time.Second)
        fmt.Printf("end id: %d\n", id)
}

func main() {
        fmt.Println("Started")
        wg := sync.WaitGroup{}
        for i := 0; i < 16; i++ {
                wg.Add(1)
                go func(id int) {
                        execute(id)
                        wg.Done()
                }(i)
        }
        wg.Wait()
        fmt.Println("Finished")
}

with hermit run -e="GOMAXPROCS=2" --chaos --seed=<seed> main | shasum (I made a quick patch to Go to work around the original issue) shows that we do indeed get different executions for different values of the seed. There's some work stealing happening in the go scheduler that could explain this. And I bet you'd get more coverage if you had more than 256 goroutines per thread, since then they get queued up to a global queue.