Closed susan-garry closed 1 year ago
Thanks for the bug report! I think it might be guard generator that overflowing so I can take a look. However, we should also chat about the generated code. Everything is sequential so I'm not sure why there are so many groups and invoke
statements are generated. The likely cause of the overflow is generating extremely large guards which will result in extremely bad resource usage
Also, @calebmkim, got a new stress test for sharing pass for you (using DRB1-3123.futil.txt
):
% ./target/release/futil -b verilog --log info pange.futil > /dev/null
[INFO calyx::pass_manager] well-formed: 111ms
[INFO calyx::pass_manager] papercut: 49ms
[INFO calyx::pass_manager] canonicalize: 55ms
[INFO calyx::pass_manager] compile-sync: 5ms
[INFO calyx::pass_manager] group2seq: 28ms
[INFO calyx::pass_manager] group2invoke: 22ms
[INFO calyx::pass_manager] inline: 44ms
[INFO calyx::pass_manager] comb-prop: 43ms
[INFO calyx::pass_manager] compile-ref: 37ms
[INFO calyx::pass_manager] infer-share: 3ms
[INFO calyx::pass_manager] cell-share: 153387ms // <- 2.5 minutes!
[INFO calyx::pass_manager] remove-comb-groups: 0ms
[INFO calyx::pass_manager] infer-static-timing: 330ms
[INFO calyx::pass_manager] compile-invoke: 94ms
[INFO calyx::pass_manager] merge-static-par: 68ms
[INFO calyx::pass_manager] static-par-conv: 1630ms
[INFO calyx::pass_manager] dead-group-removal: 40ms
[INFO calyx::pass_manager] collapse-control: 16ms
[INFO calyx::pass_manager] tdcc: 6399ms
[INFO calyx::pass_manager] dead-group-removal: 11ms
[INFO calyx::pass_manager] comb-prop: 9997ms
[INFO calyx::pass_manager] dead-cell-removal: 9044ms
[INFO calyx::pass_manager] go-insertion: 33ms
[INFO calyx::pass_manager] wire-inliner: 1789ms
[INFO calyx::pass_manager] clk-insertion: 66ms
[INFO calyx::pass_manager] reset-insertion: 54ms
[INFO calyx::pass_manager] merge-assigns: 602ms
Okay, according to @susan-garry, y'all previously had programs with the same number of PEs and memories and the only change is generating invoke
statements which is curious. Some things to note:
cargo build --release
and changing fud
to use the release compiler (fud c futil.exec "$pwd/target/release/futil"
in the Calyx repo)Now onto sources of overflow:
--log info
(fud e -s futil.flags ' -p all --log info ...
), you get the above log which shows none of the compiler passes overflow which is goodTo get the overflow quicker, we can run the compiler with optimizations disabled using -p no-opt
:
% ./target/release/futil -b verilog --log info -p no-opt pange.futil > /dev/null
[INFO calyx::pass_manager] well-formed: 98ms
[INFO calyx::pass_manager] papercut: 48ms
[INFO calyx::pass_manager] canonicalize: 65ms
[INFO calyx::pass_manager] compile-sync: 5ms
[INFO calyx::pass_manager] compile-ref: 41ms
[INFO calyx::pass_manager] remove-comb-groups: 0ms
[INFO calyx::pass_manager] compile-invoke: 86ms
[INFO calyx::pass_manager] tdcc: 6388ms
[INFO calyx::pass_manager] go-insertion: 61ms
[INFO calyx::pass_manager] wire-inliner: 2021ms
[INFO calyx::pass_manager] clk-insertion: 93ms
[INFO calyx::pass_manager] reset-insertion: 92ms
[INFO calyx::pass_manager] merge-assigns: 979ms
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
zsh: abort ./target/release/futil -b verilog --log info -p no-opt pange.futil > /dev/nul
Running the compiler to just print out the Calyx program after compilation doesn't overflow:
% ./target/release/futil --log info -p no-opt pange.futil > out.futil
[INFO calyx::pass_manager] well-formed: 111ms
[INFO calyx::pass_manager] papercut: 47ms
[INFO calyx::pass_manager] canonicalize: 118ms
[INFO calyx::pass_manager] compile-sync: 13ms
[INFO calyx::pass_manager] compile-ref: 42ms
[INFO calyx::pass_manager] remove-comb-groups: 1ms
[INFO calyx::pass_manager] compile-invoke: 89ms
[INFO calyx::pass_manager] tdcc: 6474ms
[INFO calyx::pass_manager] go-insertion: 58ms
[INFO calyx::pass_manager] wire-inliner: 2067ms
[INFO calyx::pass_manager] clk-insertion: 83ms
[INFO calyx::pass_manager] reset-insertion: 87ms
[INFO calyx::pass_manager] merge-assigns: 957ms
Running with lldb
, I get this:
% lldb -- ./target/release/futil -b verilog --log info -p no-opt pange.futil
...
Process 69743 stopped
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7ff7bf700ff8)
frame #0: 0x0000000100295824 futil`core::ptr::drop_in_place$LT$pretty..Doc$LT$pretty..RcDoc$GT$$GT$::hc5ff1d9bc7dd2c56 + 132
futil`core::ptr::drop_in_place$LT$pretty..Doc$LT$pretty..RcDoc$GT$$GT$::hc5ff1d9bc7dd2c56:
-> 0x100295824 <+132>: callq 0x1002957a0 ; <+0>
0x100295829 <+137>: movq 0x8(%rbx), %rdi
0x10029582d <+141>: callq 0x1002d6778 ; symbol stub for: free
0x100295832 <+146>: movq 0x10(%rbx), %rdi
Target 0: (futil) stopped.
(lldb)
RcDoc
is the library vast
, our verilog backend, uses to print out verilog strings. The EXC_BAD_ACCESS
is concerning because it seems to imply a bad memory access which should not happen in Rust? RcDoc probably uses unsafe code which might be misbehaving? Still, the fact what we're getting to pretty printing means that its probably something wrong in vast
or RcDoc
.
Also, summoning @EclecticGriffin's Rust powers in case they have better ideas of what could be going wrong
Building with AddressSanitizer:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==72654==ERROR: AddressSanitizer: stack-overflow on address 0x7ff7b67509b8 (pc 0x000109f797df bp 0x7ff7b67511f0 sp 0x7ff7b67509c0 T0)
#0 0x109f797df in __asan_memcpy+0x18f (librustc-nightly_rt.asan.dylib:x86_64+0x467df) (BuildId: 364f88c707f13921b85413add38b47d4240000001000000000070a0000010c00)
#1 0x1097cf2b8 in arrayvec::array_string::ArrayString$LT$A$GT$::new::he839dde011b09996+0xc8 (futil:x86_64+0x10081f2b8) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#2 0x1097ad951 in pretty::RcDoc$LT$A$GT$::as_string::h7398701d6df83bc0+0x121 (futil:x86_64+0x1007fd951) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#3 0x1097b50b2 in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x11d2 (futil:x86_64+0x1008050b2) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#4 0x1097b5ac1 in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x1be1 (futil:x86_64+0x100805ac1) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#5 0x1097b624c in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x236c (futil:x86_64+0x10080624c) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#6 0x1097b624c in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x236c (futil:x86_64+0x10080624c) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#7 0x1097b624c in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x236c (futil:x86_64+0x10080624c) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
#8 0x1097b624c in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x236c (futil:x86_64+0x10080624c) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
...
#254 0x1097b624c in vast::subset::pretty_print::_$LT$impl$u20$vast..util..pretty_print..PrettyPrint$u20$for$u20$vast..subset..ast..Expr$GT$::to_doc::h91b5fcf4d36ccf10+0x236c (futil:x86_64+0x10080624c) (BuildId: 8c58d4e6d80a312cb8d786fa62c5165932000000200000000100000000000d00)
Looks like a bug in the pretty printing implementation in vast
?
Okay, I've reduced the test case and it seems to come from...the cells? You don't even the control program. Just a lot of cells marked with @external
Here is the reduced file that still overflows. Don't have more cycles today but the problem is probably fixable in verilog.rs
Nice work isolating the "big cells list" test case, @rachitnigam! Given that it's a segfault, I share the intuition that it could just be a stack overflow in VAST… I'll see if I have a moment to break out lldb as well.
Just to add on to this:
I am getting an identical error when trying to get resource estimates when I fully inline some of the larger Calyx neural networks. Based on this thread, I'm guessing that when I fully inline everything, it's adding a bunch of new cells to main
causing an overflow when we try to go from compiled Calyx -> Verilog.
Should I try to fix this bug?
The interesting thing is that removing the external attribute from cells makes the compilation work. I’d have to see if that’s really something or just a consequence of making the generated code smaller
Okay, I've debugged this some more and the problem is pretty wild–VAST generates really large modules which the pretty printing module turns into string. Everything is fine till the pretty printing function goes to return the value. The function runs drop
to drop the struct allocated by VAST's representation of the documentation and blows the stack during the drop process
That is absolutely bonkers 👏
I tried running fud e --to dat --through verilog -s verilog.data DRB1.data DRB1.futil
(after updating calyx), but I still get a similar overflow error. The solution seems to be to rewrite the program to be smaller, but I want to check that I'm not missing something since @rachitnigam mentioned being able to compile this program to verilog in #1280.
@susan-garry are you building and running the compiler in release mode. You need to build the compiler using release mode:
cargo build --release
And then change fud to use the release binary:
fud e stages.futil.exec "<calyx repo>/target/release/futil
It should not get a overflow error anymore. If it does, open a new issue please
As in the title,
fud e --to interpreter-out -s verilog.data DRB1-3123.data DRB1-3123.futil
produces the expected output, whilefud e --to dat --through verilog -s verilog.data DRB1-3123.data DRB1-3123.futil
andfud e --to dat --through icarus-verilog -s verilog.data DRB1-3123.data DRB1-3123.futil
produce nearly identical error messages:The details of verilator's error message is in error.txt. I also uploaded the input files, in case anyone is interested in duplicating the error.
error.txt DRB1-3123.data.txt DRB1-3123.futil.txt