YosysHQ / sby

SymbiYosys (sby) -- Front-end for Yosys-based formal verification flows
Other
389 stars 73 forks source link

Use an Asynchronous Event Loop in SbyJob (rebased) #108

Open edbordin opened 4 years ago

edbordin commented 4 years ago

This is mostly just a rebased version of #39. On @whitequark's request I have also replaced the use of taskkill on Windows with a console process group that is terminated using a ctrl+break signal. I verified using SysInternals ProcessMonitor that this causes the child processes to exit with code 0xC000013A which means they received the signal.

cc: @cr1901 as the original author

edbordin commented 4 years ago

It might be a couple days before I can dig into this again, but the first issue with missing tools definitely looks like a regression introduced by me. The handling of exit code 127 was added since the original PR was authored and although I tried to incorporate it, it looks like it will need to be adapted.

I haven't run into the second issue but it looks like the atexit handler here might be enough to mitigate it on older versions of python.

I am assuming the CI infrastructure is internal to Symbiotic EDA? I wouldn't blame you for wanting some unit tests for this but wouldn't want to reinvent the wheel. Also some of these errors are probably race conditions and hard to reproduce...

FWIW I was having issues with Yices crashing on OS X with python 3.8 (which uses the same event loop as linux) and here are two example logs:

2020-07-22T11:18:27.3795240Z SBY 11:18:27 [cover] Removing directory 'cover'.
2020-07-22T11:18:27.3796110Z SBY 11:18:27 [cover] Copy 'cover.sv' to 'cover/src/cover.sv'.
2020-07-22T11:18:27.3810530Z SBY 11:18:27 [cover] engine_0: smtbmc
2020-07-22T11:18:27.3846430Z SBY 11:18:27 [cover] base: starting process "cd cover/src; yosys -ql ../model/design.log ../model/design.ys"
2020-07-22T11:18:27.4315680Z SBY 11:18:27 [cover] base: finished (returncode=0)
2020-07-22T11:18:27.4317070Z SBY 11:18:27 [cover] smt2: starting process "cd cover/model; yosys -ql design_smt2.log design_smt2.ys"
2020-07-22T11:18:27.4468120Z SBY 11:18:27 [cover] smt2: finished (returncode=0)
2020-07-22T11:18:27.4469830Z SBY 11:18:27 [cover] engine_0: starting process "cd cover; yosys-smtbmc --presat --unroll -c --noprogress -t 20  --append 0 --dump-vcd engine_0/trace%.vcd --dump-vlogtb engine_0/trace%_tb.v --dump-smtc engine_0/trace%.smtc model/design_smt2.smt2"
2020-07-22T11:18:27.5208620Z SBY 11:18:27 [cover] engine_0: ##   0:00:00  Solver: yices
2020-07-22T11:18:27.5274760Z SBY 11:18:27 [cover] engine_0: Traceback (most recent call last):
2020-07-22T11:18:27.5281100Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/yosys-smtbmc", line 1365, in <module>
2020-07-22T11:18:27.5282940Z SBY 11:18:27 [cover] engine_0: smt_assert_antecedent("(|%s_is| s0)" % (topmod))
2020-07-22T11:18:27.5284390Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/yosys-smtbmc", line 1149, in smt_assert_antecedent
2020-07-22T11:18:27.5284990Z SBY 11:18:27 [cover] engine_0: smt.write("(assert %s)" % expr)
2020-07-22T11:18:27.5285930Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 441, in write
2020-07-22T11:18:27.5286940Z SBY 11:18:27 [cover] engine_0: stmt = self.unparse(self.unroll_stmt(s))
2020-07-22T11:18:27.5288130Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 297, in unroll_stmt
2020-07-22T11:18:27.5288700Z SBY 11:18:27 [cover] engine_0: stmt = [self.unroll_stmt(s) for s in stmt]
2020-07-22T11:18:27.5289650Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 297, in <listcomp>
2020-07-22T11:18:27.5290200Z SBY 11:18:27 [cover] engine_0: stmt = [self.unroll_stmt(s) for s in stmt]
2020-07-22T11:18:27.5291150Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 327, in unroll_stmt
2020-07-22T11:18:27.5291720Z SBY 11:18:27 [cover] engine_0: self.write(self.unparse(decl), unroll=False)
2020-07-22T11:18:27.5292630Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 474, in write
2020-07-22T11:18:27.5293170Z SBY 11:18:27 [cover] engine_0: self.p_write(stmt + "\n", True)
2020-07-22T11:18:27.5294060Z SBY 11:18:27 [cover] engine_0: File "/Users/runner/work/1/s/fpga-toolchain/bin/../share/yosys/python3/smtio.py", line 358, in p_write
2020-07-22T11:18:27.5294600Z SBY 11:18:27 [cover] engine_0: if flush: self.p.stdin.flush()
2020-07-22T11:18:27.5295000Z SBY 11:18:27 [cover] engine_0: BrokenPipeError: [Errno 32] Broken pipe
2020-07-22T11:18:27.5330800Z SBY 11:18:27 [cover] engine_0: finished (returncode=1)
2020-07-22T11:18:27.5332370Z SBY 11:18:27 [cover] ERROR: engine_0: Engine terminated without status.
2020-07-22T11:18:27.5336050Z SBY 11:18:27 [cover] DONE (ERROR, rc=16)
2020-07-22T11:18:27.5449760Z 
2020-07-22T11:18:27.5545270Z ##[error]Bash exited with code '16'.
2020-07-22T11:28:44.8130880Z SBY 11:28:44 [cover] Removing directory 'cover'.
2020-07-22T11:28:44.8131790Z SBY 11:28:44 [cover] Copy 'cover.sv' to 'cover/src/cover.sv'.
2020-07-22T11:28:44.8321440Z SBY 11:28:44 [cover] engine_0: smtbmc
2020-07-22T11:28:44.8639690Z SBY 11:28:44 [cover] base: starting process "cd cover/src; yosys -ql ../model/design.log ../model/design.ys"
2020-07-22T11:28:45.0847470Z SBY 11:28:45 [cover] base: finished (returncode=0)
2020-07-22T11:28:45.0849010Z SBY 11:28:45 [cover] smt2: starting process "cd cover/model; yosys -ql design_smt2.log design_smt2.ys"
2020-07-22T11:28:45.1195680Z SBY 11:28:45 [cover] smt2: finished (returncode=0)
2020-07-22T11:28:45.1197430Z SBY 11:28:45 [cover] engine_0: starting process "cd cover; yosys-smtbmc --presat --unroll -c --noprogress -t 20  --append 0 --dump-vcd engine_0/trace%.vcd --dump-vlogtb engine_0/trace%_tb.v --dump-smtc engine_0/trace%.smtc model/design_smt2.smt2"
2020-07-22T11:28:45.2781990Z SBY 11:28:45 [cover] engine_0: ##   0:00:00  Solver: yices
2020-07-22T11:28:45.2829950Z SBY 11:28:45 [cover] engine_0: ##   0:00:00  Checking cover reachability in step 0..
2020-07-22T11:28:45.3661690Z SBY 11:28:45 [cover] engine_0: ##   0:00:00  Unexpected EOF response from solver.
2020-07-22T11:28:45.3719500Z SBY 11:28:45 [cover] engine_0: finished (returncode=1)
2020-07-22T11:28:45.3779920Z SBY 11:28:45 [cover] ERROR: engine_0: Engine terminated without status.
2020-07-22T11:28:45.3892720Z SBY 11:28:45 [cover] DONE (ERROR, rc=16)
2020-07-22T11:28:45.4223840Z 
2020-07-22T11:28:45.4299320Z ##[error]Bash exited with code '16'.
nakengelhardt commented 4 years ago

Yes, looking at it more closely it's just a question of moving the check for returncode == 127 before the call to self.handle_exit on line 186. I've had improving the error handling on my to-do list for an embarrassingly long time, that's why I was a bit sensitive to it. Those error traces you posted do look fine, it's relatively clear what the problem is.

We have added some tests to this repo the other day, which we also use in CI. If you merge recent master you can use make test now, and that should cover some ground (but they're not unit tests, and they will miss things). It also tries out all the solvers, so if you don't have some of them installed, you may have to call the tests individually that only use solvers you have, as there's no skipping mechanism yet. For the bug I linked, I don't really know how to reproduce it either - it was thrown by the CI runner code, which happens to use asyncio, not in the sby code itself, which doesn't yet. It might be a nonissue, I was just wondering if it was considered.

I'll be honest though, it's going to be an uphill battle getting this PR merged, as there's a lot of platforms to consider (every flavor of linux from ancient CentOS distributions to cutting-edge Arch, MacOS, several ways of launching things on Windows...) and we don't have the structure in place to test a major rewrite like this on all of them. The current implementation is at a point where we're mostly sure that it works most of the time on most of these platforms only because we got enough bug reports from users that told us when it didn't work, and it's an unappealing proposition to make our customers go through that again. Do I understand the description of the original PR correctly that the problem this addresses is just the performance impact of not running multiple solvers in parallel on windows?

cr1901 commented 4 years ago

Do I understand the description of the original PR correctly that the problem this addresses is just the performance impact of not running multiple solvers in parallel on windows?

I can't remember if multiple solvers in parallel worked in Windows before this PR. When I said "parallel" in the original PR, that was in reference to the intended select functionality.

The performance impact comes from select returning immediately on Windows because selecting on file descriptors isn't actually supported. Because select returns immediately, the Windows version of sby busy-waits for all events to complete, rather than blocking in select until either an event arrives or a timeout is reached. I think there should be parity in resource usage between Windows and Unix here.

From what I remember, sby's main loop is basically: If an event arrives, handle event. If timeout is reached, check to see if we should bail, and if not, wait for another event. IIRC, the async rewrite removes the "check to see if we should bail part", and thus the need for any looping logic, by making the Python event loop decide when to bail.

nakengelhardt commented 4 years ago

Ah, I see. So it's still working correctly before this change though, just suboptimally because it wastes one core busy-looping, right?

cr1901 commented 4 years ago

@nakengelhardt I apologize, it looks like I misremembered a bit. The busy-waiting is a problem, but by "parallel" it looks like I was referring to this comment in a previous PR:

However, the support is limited to one task per job, because the poll loop relies on select.

Unfortunately, I don't remember the difference between SbyJob and SbyTask anymore (Job is per solver, and Task is "unit of work associated with a Job?") .

At the time, what I meant by this is "if you're doing prove mode, which will run BMC and k-induction, sby on Linux will run BMC and k-induction at the same time for a given solver- i.e. in parallel. On Windows, BMC will run first and then k-induction will run- i.e. sequentially." While sby does order Jobs (or is it Tasks?) to run in the proper order, BMC and k-induction do not depend on each other, so they can run at the same time. FWIW, I can confirm this problem still exists on Windows as of the current HEAD (b172357).

I didn't write a good bug report- if I did, I would described this behavior in more detail other than "the culprit is select doesn't work on fds on Windows". At this time, I don't remember why select's behavior on Windows breaks running jobs in parallel. The async rewrite fixed this behavior as well as the "busy waiting pinning a core" problem, at the cost of... well, other issues as described in this PR and #39.