cznic / virtual

github.com/cznic/virtual has moved to modernc.org/virtual
https://godoc.org/modernc.org/virtual
BSD 3-Clause "New" or "Revised" License
20 stars 3 forks source link

Windows support: strategy question (linux emulation or "native") #1

Closed steffengy closed 7 years ago

steffengy commented 7 years ago

Not sure if this is the right repository to place this issue, since it also affects ccir, but i feel it might belong a bit more here.

Is a windows implementation planned?

I see two very different ways of implementing it:

Which of those (or maybe an other even better idea?) would you consider the favorable solution?

Other todos:

cznic commented 7 years ago

The IR error still persists, still wondering if it's related to wrong codegen (seems like a uint32 = uint64 convert) or something else.

I believe it is incorrect codegen.

steffengy commented 7 years ago

go test -run TestSqlite > ir.dump: https://gist.github.com/steffengy/0dbe7b52a8332010d627f601b46c5afb

So do you advise first looking into that mentioned GCC-Test-Fault/Panic pasted above? (gcc.c-torture\execute\20041214-1.c)

cznic commented 7 years ago

Thanks, but the memsys5Size function dump ends at PC 0x18 but the fault is at 0x2e, please paste the full dump or at least up to an including PC 0x2e.

So do you advise first looking into that mentioned GCC-Test-Fault/Panic pasted above? (gcc.c-torture\execute\20041214-1.c)

Yes, because the sources of GCCExec tests are often small programs, so they are easier to analyze.

steffengy commented 7 years ago

Sorry somehow my clipboard messed up. Updated the file.

Yeah it's just that it's a fault in a Load64 instruction (while loading the value during readI64) so something that get's an invalid pointer on the stack.

That means to track down the origin I first have to somehow dump all OPs, look at them and try to determine the origin, which likely is also in CCIR, or do you have a better idea?

I could also search for all places that emit Load64 intructions and try to spot some obvious mistake (assuming it's somewhere related to long).

cznic commented 7 years ago

All of the approaches you mention are reasonable. I will try first the IR dump, but it must unfortunately wait a bit. I'm right now pushing the new profiling API to virtual (and its use in sqlite), then I'm leaving the office for today.

steffengy commented 7 years ago

Finally managed to dump the LOG output for 20041214-1.c:

0x00079         load64         0x0
0x0007a         push32         0x8
0x0007b         negindexu32    0x1
0x0007c         store64
0x0007d         load64         0x0       <------ crash

Added the full output also to the GIST link above.

Let me know if you know if that might be related or is something else. I guess the actual problem is negindexu32 --> store64, right?

cznic commented 7 years ago

Here I am. Will report back when/if I find something.

cznic commented 7 years ago

I don't know if I have mentioned the -virtual.trace build tag before. Can be useful in chasing such crashes, but it seems to me you've discovered it already. Another help is in some cases adding -virtualTesting to $ go test in ccir.

steffengy commented 7 years ago

@cznic For the GCC case I just added a println before the actual execution (since it faults in the execution of an opcode, which won't be outputed after afaik, atleastpassing -tags virtual.trace to go test didn't help.)

The problem definitely is that Store64 gets some invalid input. It writes 4337958760 which Load64 uses as pointer. (readI64) Only doing a readI32 in Store64 would result in 42991464, which would be a valid pointer.

The question is just where the input is coming from. Still have to understand quite few OPCodes.

cznic commented 7 years ago

Thanks for more data points. So far no idea about 20041214-1.c. I want to look at the run time stack values.

Please paste $ go test -v -run GCC -re 20041214-1.c -tags virtual.trace, it shows also the top 4 or 5 stack slots, maybe it can help.

FTR: Stack slots are shown as if the item is a pointer. Stack slots are aligned at that size, but if the actutal item is smaller, the higher bits are garbage, but that is most often perfectly okay.

Still have to understand quite few OPCodes.

I feel the pain. After some months fiddling with them I can read a bit ;-)

Also, IR assembly and virtual assembly are two separate beasts, sorry.

The question is just where the input is coming from.

I can't even yet figure out why there's that negindex thing.

cznic commented 7 years ago

Windows

0x00076     push           bp-0x10
0x00077     push           ap-0x18
0x00078     dup64          
0x00079     load64         0x0
0x0007a     push32         0x8
0x0007b     negindexu32    0x1

Linux

0x00076     push           bp-0x10
0x00077     push           ap-0x18
0x00078     dup64          
0x00079     load64         0x0
0x0007a     push64         0x8
0x0007b     negindexu64    0x1

We need to figure out why windows does a push32 0x8 at 0x7a instead of push push64. That's the cause I believe.

steffengy commented 7 years ago

On a second thought I interpret NegIndexU32 as doing some array indexing with size 8 (op.N). So my current assumption is that either that NegIndexU32 is called with the wrong size (8 instead of 4) or the Store64/Load64 should be a Store32/Load32.

go test -v -run GCC -re 20041214-1.c -tags virtual.trace

That doesn't output any stack slots unfortunately (since it's a fatal error: fault) and immediately aborts (accessing an "invalid" pointer causing a SEGFAULT)

Also, IR assembly and virtual assembly are two separate beasts, sorry.

The IR assembly is higher-level though, right?

So type-errors, like I presume here, should be easier to spot? (Except if it's actually some translation of IR -> Virtual, in loader.go)

I can't even yet figure out why there's that negindex thing.

Ye me neither, the dump unfortunately only include a position for IR and not virtual-opcode (as you can see in the GIST file).

We need to figure out why windows does a push32 0x8 at 0x7a instead of push push64

Windows also does a negindexu32 vs negindexu64.

It'd be helpful to somehow link 0x0007d to the respective IR, is that somehow feasible?

cznic commented 7 years ago

0x0007d load64 0x0

is

0x00049 do_form_string:         ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:38:7
0x0004a     variable            &#1, **int8 ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:2
0x0004b     argument            &#2, **int8 ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x0004c     dup                 **int8  ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x0004d     load                **int8  ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x0004e     convert             *int8, *int8    ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x0004f     const               0x8, uint32 ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:11
0x00050     convert             uint32, uint32  ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:11
0x00051     element             &[-uint32], *int8   ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x00052     convert             *int8, *int8    ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x00053     store               *int8   ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:29
0x00054     convert             *int8, **int8   ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:11
0x00055     load<--here             **int8  ; ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:39:11
steffengy commented 7 years ago

Yeah so it should be load64. Sounds like some code assumes sizeof(pointer)=long ?

steffengy commented 7 years ago

Not sure if this helps you but the following line emits the push32 0x8: https://github.com/cznic/virtual/blob/master/loader.go#L1481

EDIT: NVM it's also in the IR-LOG:

  const  0x8, uint32     ; ..\\gcc\testsuite\gcc.c-torture\execute\20041214-1.c:35:2
cznic commented 7 years ago

Yeah so it should be load64.

It is load64. But the address computed on TOS is wrong. Don't yet understand why.

Sounds like some code assumes sizeof(pointer)=long ?

Seems like that, but looking everywhere I cannot find that.

cznic commented 7 years ago

Please try do change this in ccir/libc/builtin.h

from

#define __builtin_va_arg(ap, type) *(type*)(ap -= __roundup(sizeof(type), __SIZEOF_POINTER__))

to

#define __builtin_va_arg(ap, type) *(type*)(ap -= (__UINTPTR_TYPE__)(__roundup(sizeof(type), __SIZEOF_POINTER__)))

Kind of blind shot, Linux still works with it.

cznic commented 7 years ago

The IR assembly is higher-level though, right?

In a certain sense yes. It has type annotations. It avoids resolving offsets, which only the VM should know about, to constants numbers. Jumps targets are symbols, not addresses etc.

steffengy commented 7 years ago

Yep that test now passes. Now 20001203-2.c seems to fail with the same IR error a similar as SQLite before. Added the full output to the GIST.

cznic commented 7 years ago

Yep that test now passes. Now 20001203-2.c seems to fail with the same IR error as SQLite before. Added the full trace to the GIST.

Okay, we have progress ;-)

steffengy commented 7 years ago

Just so I'm not on the wrong track here:

# [55, err]: Verify (A): mismatched types, got int32, expected int64
        create_array_type:0x83:     convert             int64, *int8

that means that insteadof an int64 an int32 was passed as argument, right?

Also a stacktrace where that convert is emitted:

CONVERT:  {*int8 int64 ..\cc\testdata\gcc-6.3.0\gcc\testsuite\gcc.c-torture\execute\20001203-2.c:77:29}
goroutine 18 [running]:
runtime/debug.Stack(0xc042058008, 0xc042437890, 0x2)
        C:/Go/src/runtime/debug/stack.go:24 +0x80
runtime/debug.PrintStack()
        C:/Go/src/runtime/debug/stack.go:16 +0x29
github.com/cznic/ccir.(*c).convert(0xc04204c000, 0x8abca0, 0xc0423ed080, 0x8bb800, 0xc042395a40, 0x8bb800, 0xc0423451f0, 0x70c480, 0xc0423ca6c0)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:1460 +0x2bd
github.com/cznic/ccir.(*c).expression(0xc04204c000, 0x8bb800, 0xc0423451f0, 0xc0423ed080, 0xc042395a40, 0x0)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:1823 +0xd896
github.com/cznic/ccir.(*c).expression(0xc04204c000, 0x8bb800, 0xc0423aa7e0, 0xc0423ed440, 0x0, 0x0)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:2087 +0x6843
github.com/cznic/ccir.(*c).expression(0xc04204c000, 0x0, 0x0, 0xc0423ed500, 0xc0423aa7e0, 0x4a)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:2215 +0xb48f
github.com/cznic/ccir.(*c).expression(0xc04204c000, 0x8bb800, 0xc0423aa230, 0xc0423ed500, 0x8bb800, 0xc0423aa230)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:1812 +0xd0aa
github.com/cznic/ccir.(*c).expressionList(0xc04204c000, 0x8bb800, 0xc0423aa230, 0xc042366f50, 0x8bb800, 0xc0423aa230)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:2275 +0x85
github.com/cznic/ccir.(*c).expressionListOpt(0xc04204c000, 0x8bb800, 0xc0423aa230, 0xc0423645a0, 0x0)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:2300 +0x11d
github.com/cznic/ccir.(*c).expressionStatement(0xc04204c000, 0xc0423ea100, 0x0)
        D:/projects/Go/gopath/src/github.com/cznic/ccir/ccir.go:2304 +0x68
github.com/cznic/ccir.(*c).statement(0xc04204c000, 0xc04243ce88, 0xc042351e00, 0x0)
.......
cznic commented 7 years ago

That means that insteadof an int64 an int32 was passed as argument, right?

The error comes from here

Not as an function argument but as the computed type of the TOS. Function argumens are loaded to the evaluation stack by the IR Argument opcode.

Also a stacktrace where that convert is emitted:

Yep, that proves ccir is wrongly thinking what's on the TOS. Fortunately, IR.Verify sees the error.

Compilers are complicated and this part (ccir) is only few month old. There must be a lot of similar bugs in it waiting to ruin everything ;-)

The line numbers are offset from what I see, please push your tree, thank you.

steffengy commented 7 years ago

The line numbers are offset from what I see, please push your tree, thank you.

Done.

steffengy commented 7 years ago

Is the following assumption correct:

&cc.Expression{
· BinOpType: long long,
· Type: long long,
· Case: 29,
· Expression: &cc.Expression{
· · Type: long long,
· · Case: 30,
· · Expression: &cc.Expression{
· · · Type: char*,
· · · Case: 11,
· · · Expression: &cc.Expression{
· · · · Type: struct obstack*,
· · · · Token: execute\20001203-2.c:77:29: IDENTIFIER "__o1",
· · · },
· · · Token: .execute\20001203-2.c:77:33: ARROW,
· · · Token2: execute\20001203-2.c:77:35: IDENTIFIER "next_free",
· · },
· · Expression2: &cc.Expression{
· · · Type: char*,
· · · Case: 25,
· · · Expression: &cc.Expression{
· · · · Type: int,
· · · · Case: 3,
· · · · Token: execute\20001203-2.c:77:57: INTCONST "0",
· · · },
· · · Token: .execute\20001203-2.c:77:48: '(',
· · · Token2: execute\20001203-2.c:77:55: ')',
· · · TypeName: &cc.TypeName{
· · · · Type: char*,
· · · · AbstractDeclaratorOpt: &cc.AbstractDeclaratorOpt{
· · · · · AbstractDeclarator: &cc.AbstractDeclarator{
· · · · · · Pointer: &cc.Pointer{
· · · · · · · Token: execute\20001203-2.c:77:54: '*',
· · · · · · },
· · · · · },
· · · · },
· · · · SpecifierQualifierList: &cc.SpecifierQualifierList{
· · · · · TypeSpecifier: &cc.TypeSpecifier{
· · · · · · Case: 1,
· · · · · · Token: execute\20001203-2.c:77:49: CHAR "char",
· · · · · },
· · · · },
· · · },
· · },
· · Token:execute\20001203-2.c:77:46: '-',
· },
· Expression2: &cc.Expression{
· · Type: int,
· · Case: 11,
· · Expression: &cc.Expression{
· · · Type: struct obstack*,
· · · Token: 20001203-2.c:78:14: IDENTIFIER "__o1",
· · },
· · Token: execute\20001203-2.c:78:18: ARROW,
· · Token2: execute\20001203-2.c:78:20: IDENTIFIER "alignment_mask",
· },
· Token: execute\20001203-2.c:78:13: '+',
}

func convert tries to convert that expression to *int8 assuming the type is int64 (which seems to be correc) since it's the expression type). Ttherefore a convert int64 *int8 is emitted. Then somehow in IR the type of that expresion is int32 !? This mismatch (the reason of it is what we search) causes the error.

cznic commented 7 years ago

Meet our killer ;-)

steffengy commented 7 years ago

Well nice! That should be long long int for MINGW. A default GCC returns long int. Any idea what the "correct" solution there would be?

cznic commented 7 years ago

This does not break Linux, please try if it fixes Windows:

jnml@4670:~/src/github.com/cznic/ccir$ git status 
HEAD detached at bcbdb5a
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   ccir.go

no changes added to commit (use "git add" and/or "git commit -a")
jnml@4670:~/src/github.com/cznic/ccir$ git diff
diff --git a/ccir.go b/ccir.go
index 49bc039..5f38a7b 100644
--- a/ccir.go
+++ b/ccir.go
@@ -1554,7 +1554,7 @@ func (c *c) binopType(n *cc.Expression) cc.Type {
                }

                if a.Kind() == cc.Ptr && b.Kind() == cc.Ptr && n.Case == 30 { // Expression '-' Expression                          // Case 30
-                       return c.ast.Model.LongType
+                       return n.Type
                }

                return t
jnml@4670:~/src/github.com/cznic/ccir$ 
cznic commented 7 years ago

Looks like that was the only direct reference to cc.Model.LongType in ccir/ccir.go, good for us.

steffengy commented 7 years ago

Yep works (even the SQLite test runs into the next intruction trap VERY NICE! so the GetSystemTime() "syscall" works now)

C4 shows some breakage though:

--- FAIL: TestC4 (0.11s)
        all_test.go:959: [./c4]
                usage: c4 [-s] [-d] file ...
                4.501ms
        all_test.go:973: [./c4 hello.c]
                hello, world
                exit(0) cycle = 9
                2.4994ms
        all_test.go:977: [./c4 -s hello.c]
                1: #include <stdio.h>
                2:
                3: int main()
                4: {
                5:   printf("hello, world\n");
                    ENT  0
                    IMM  42992000
                    PSH
                    PRTF
                    ADJ  1
                6:   return 0;
                    IMM  0
                    LEV
                7: }
                    LEV
                2.5ms
        all_test.go:702: virtual.Exec: PANIC: runtime error: invalid memory address or nil pointer dereference
                testdata\github.com\rswier\c4\c4.c.main(0x2, 0x2880168)
                        testdata\github.com\rswier\c4\c4.c:501:1        0x004ff         load8          0x0      // else if (i == LC)  a = *(char *)a;                                    // load char
                D:\projects\Go\gopath\src\github.com\cznic\ccir\libc\crt0.c._start(0x3, 0x2880160)
                        D:\projects\Go\gopath\src\github.com\cznic\ccir\libc\crt0.c:15:1        0x0000e         callfp
// exit(((int (*)())main) (argc, argv));

https://gist.github.com/steffengy/0dbe7b52a8332010d627f601b46c5afb contains the Full-IR-Dump [quite big]

cznic commented 7 years ago

Yep works (even the SQLite test runs into the next intruction trap VERY NICE! so the GetSystemTime() "syscall" works now)

:100:

C4 shows some breakage though:

C4 wants to be compiled in 32bit mode. I dirty-hacked it to work on both 32 and 64 bit Linux here. Because it has only one type, int, it now has a wrong pointer type (still int).

Try to change that line to read

#define int __PTRDIFF_TYPE__

Here that seems to work. (But I've not tested Linux/386 for some time.)

steffengy commented 7 years ago

That leads to:

     all_test.go:959: [./c4]
                usage: c4 [-s] [-d] file ...
                2.0007ms
        all_test.go:970:
                got
                main() not defined
                exp
                hello, world
                exit(0) cycle = 9
cznic commented 7 years ago

Ignore that test on Windows or delete it completely. It's not really valid C code after all. It was helpful at the early stages. Now it makes only troubles.

steffengy commented 7 years ago

NVM: need to change it at 2 places, fixed.

cznic commented 7 years ago

NVM: need to change it at 2 places, fixed.

Right, now I recall, once again in or somewhere before main(). Good job that you made it work on Windows as well!

cznic commented 7 years ago

Later on, please upload to gist the current TestSQLite output, it will make me happy just to look at it :stuck_out_tongue_winking_eye:

steffengy commented 7 years ago

Yeah will do, first getting as many syscalls done as possible :) We already get something like:

          Usage: ./test DATABASE SQL-STATEMENT
                5.4939ms
                []
cznic commented 7 years ago

Yeah will do, first getting as many syscalls done as possible :)

No worries.

We already get something like:

Tadaaa!

steffengy commented 7 years ago

The offset-anonymous struct members transformation now also seems to work. Fixed it after noticing it it was involved in a strange syscall failure... ReadFile returned ERROR_SUCCESS since the _OVERLAPPED structure was wrongly populated due to wrong offsets.

steffengy commented 7 years ago

Well nearly:

all_test.go:1049: ["./test" "foo" "bar"]
                READ 0x56381, 0x56f07a8, 0x586fdd8, 100, 0FAIL (266) os_win.c:40417: (0) winRead(C:\Users\steff\AppData\Local\Temp\ccir-test-988818927\foo) - Der Vorgang wurde erfolgreich beendet.
                Can't open database: disk I/O error
                8.9981ms

I'll update tomorrow when I cleaned up the local tree, investigated more, etc.

steffengy commented 7 years ago

So after the changes yesterday (fixing anonymous struct member offsets + sizeof) the SQLite Test results in the following:

unexpected fault address 0x340fff8
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0x340fff8 pc=0x645835]

goroutine 18 [running]:
runtime.throw(0x75cbb0, 0x5)
        Go/src/runtime/panic.go:596 +0x9c fp=0xc05fba2130 sp=0xc05fba2110
runtime.sigpanic()
        Go/src/runtime/signal_windows.go:164 +0xff fp=0xc05fba2160 sp=0xc05fba2130
github.com/cznic/virtual.(*cpu).stackTrace(0xc04208cee0, 0x0, 0x0)
        /cznic/virtual/cpu.go:153 +0x631 fp=0xc05fba2378 sp=0xc05fba2160
github.com/cznic/virtual.(*cpu).run.func1(0xc04208cee0)
        /cznic/virtual/cpu.go:227 +0x1c5 fp=0xc05fba2448 sp=0xc05fba2378
runtime.call32(0x0, 0x775d48, 0xc0420485f0, 0x800000008)
        Go/src/runtime/asm_amd64.s:514 +0x4f fp=0xc05fba2478 sp=0xc05fba2448
panic(0x710020, 0x8e3bf0)
        Go/src/runtime/panic.go:489 +0x2dd fp=0xc05fba2510 sp=0xc05fba2478
runtime.panicmem()
        Go/src/runtime/panic.go:63 +0x65 fp=0xc05fba2530 sp=0xc05fba2510
runtime.sigpanic()
        Go/src/runtime/signal_windows.go:161 +0x8b fp=0xc05fba2560 sp=0xc05fba2530
github.com/cznic/virtual.(*cpu).run(0xc04208cee0, 0x2, 0x0, 0x0, 0x0)
        cznic/virtual/cpu.go:1365 +0x3e7a fp=0xc05fba3038 sp=0xc05fba2560
github.com/cznic/virtual.New(0xc04209e000, 0xc059414240, 0x3, 0x3, 0x8af020, 0xc0420a68c0, 0x8af0a0, 0xc0420a6930, 0x8af0a0, 0xc0420a69a0, ...)
        cznic/virtual/virtual.go:73 +0x2de fp=0xc05fba30f8 sp=0xc05fba3038
github.com/cznic/virtual.Exec(0xc04209e000, 0xc059414240, 0x3, 0x3, 0x8af020, 0xc0420a68c0, 0x8af0a0, 0xc0420a6930, 0x8af0a0, 0xc0420a69a0, ...)
        cznic/virtual/virtual.go:84 +0xf0 fp=0xc05fba31b0 sp=0xc05fba30f8
github.com/cznic/ccir.exec(0xc04204dad0, 0xc04209e000, 0xc059414240, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        cznic/ccir/all_test.go:714 +0x5f6 fp=0xc05fba34b0 sp=0xc05fba31b0
github.com/cznic/ccir.TestSqlite(0xc04204dad0)
        cznic/ccir/all_test.go:1051 +0xd89 fp=0xc05fba3fa8 sp=0xc05fba34b0
testing.tRunner(0xc04204dad0, 0x775c70)

Doing the following in cpu.go (L134):

    ip := c.ip - 1
    sp := c.sp
    ap := c.ap

    buff := DumpCodeStr(c.code, 0)
        fmt.Println(string(buff.Bytes())) // Content of the GIST

    for ip < uintptr(len(c.code)) {
        fmt.Printf("IP %x\n", ip) // prints only 0x9f6d

Huge OPCode (10MB) GIST.

Any idea what to look for there?

cznic commented 7 years ago

It looks like we may be not on the same commit. I see this cpu.go:153, but that seems innocent.

I haven't yet tested your CC changes, if there's some wrong offset computation it can crash the VM at many place, I guess. I should write a test or two so we can rule out that place as the source of the problem.

When I was solving similar troubles, I used the -tags virtual.trace to see the "black box" recordings. It calls this method and currently shows only line info. If you comment out (sorry no switch for that) these lines you should additionally get opcodes executed, stack pointer value and top 5 stack values. It may give some insight, please try it.

I think I'll be at home in ~one hour, I'll try to look more at the 10MB gist but it looks like we really need to add func and line info to the virtual code dump for it to be useful in this or any non-trivial size.

cznic commented 7 years ago

Wrt

0x09f60 func
0x09f61     push           ap
0x09f62     sub            sp, 0x8
0x09f63     push64         (ap-0x8)
0x09f64     load64         0x0
0x09f65     load64         0x10
0x09f66     argumentsfp    
0x09f67     push64         (ap-0x8)
0x09f68     push64         (ap-0x10)
0x09f69     push32         (ap-0x18)
0x09f6a     push64         (ap-0x20)
0x09f6b     callfp         
0x09f6c     store32        
0x09f6d     return         

0x09f6b looks like an indirect call of some int(*)(int64,int64,int32,int64), or s/int64/ptr/ anywhere.

0x09f60 looks like int foo(t* arg1, u* arg2) and the function pointer to later call at f6b seems to be obtained by doing some pointer arithtmetic (indexing) with the final load64 at f65 through as fnptr := *(sp+16), which looks like a 2nd item in a table listing function pointers.

cznic commented 7 years ago

Or f60 is maybe void(int *out, t* arg).

steffengy commented 7 years ago

Seems like some imports were moved around (+ the added DumpCode): It's https://github.com/cznic/virtual/blob/504ba1aa68141a67f3933a573a837a351f5cccc1/cpu.go#L150

Don't think commiting virtual makes sense yet since it will anyways break for linux (and we don't really have other stacktrace points that are interesting?), but let me know if you need those minor changes or if it's enough.

table listing function pointers

AFAIK sqlite uses that for syscalls and similar dispatch? could be #define osCharUpperW ((LPWSTR(WINAPI*)(LPWSTR))aSyscall[2].pCurrent)

Currently dumping the new log using -tags virtual.trace with your suggested patches. Going slow @4kb/s. Running for about a couple of minutes, reaching 3MB output size. Will add it to the gist and post a link here once it's done...

steffengy commented 7 years ago
*** Test killed: ran too long (10m0s).

welp, seems like this might take quite a while.

cznic commented 7 years ago

Don't think commiting virtual makes sense yet since it will anyways break for linux (and we don't really have other stacktrace points that are interesting?), but let me know if you need those minor changes or if it's enough.

I'd like to Linux discover breakages early and continuously. I hope that way they can be fixed more easily - if it does not disrupt your workflow too much. Otherwise I can solve the breaks later, no problem.

AFAIK sqlite uses that for syscalls and similar dispatch?

Yes, SQLite uses FP tables in many places.

could be #define osCharUpperW ((LPWSTR(WINAPI*)(LPWSTR))aSyscall[2].pCurrent)

Could be something like that. [afaik not linked during IR so unlikely]

welp, seems like this might quite a while.

Probably not feasible at all time-wise.

steffengy commented 7 years ago

Just having a couple of debug Printf's and stuff inbetween so i prefer to polish that bit atleast. Generally no issue in doing that & commiting it, if that helps (which I doubt in this case?).

Currently at 2MB again, do you have some other proposition/need some additional input?

Else I think just letting it running and hoping it will be done eventually seems okay?

cznic commented 7 years ago

Need -timeout 24h or something. But the trace slows down the VM to maybe only few KIPS or even less. Meaning, if the crash does occur after N million CPU cycles... :-(

Just today I created some profiles, you may want to look at them while you wait for your log ;-) Note: those are made with -profile_rate 1, so every single instruction is recorded and thus MIPS are very low. Normally I see up to 200 MIPS on Intel 4670 CPU.

steffengy commented 7 years ago

Just today I created some profiles, you may want to look at them while you wait for your log ;-) Note: those are made with _profile_rate 1, so every single instruction is recorded and thus MIPS are very low. Normally I see up to 200 MIPS on Intel 4670 CPU.

Maybe I don't see something here but those, while being pretty cool, don't actually help here, do they? (Currently at 2MB again... I hope the end size is similar to the full-dump size so that would be 1/5)

cznic commented 7 years ago

Maybe I don't see something here but those, while being pretty cool, don't actually help here, do they?

They cannot help with ccir/ir/vm bugs. But those will be hunted down and meanwhile optimization will become important. So right, not useful yet except for having the record of the base performance now - for later comparison.

(Currently at 2MB again... I hope the end size is similar to the full-dump size so that would be 1/5)

I'm afraid that's a bit optimistic. The static code dump was ~400k instructions. The trace of every executed instruction can be several order of magnitude bigger. Really, if the crash does not occur early enough, if's not a feasible way to use the trace. It's a pitty the GCC tests did not catch anything. They are often short enough to trace fully.