StanfordAHA / Halide-to-Hardware

Other
74 stars 12 forks source link

H2H failing daily regression #81

Closed Kuree closed 4 years ago

Kuree commented 4 years ago

See https://buildkite.com/stanford-aha/garnetflow/builds/1476#fddb505e-d74c-4c86-82a5-aae72b9f399f/6-10334

This is using the latest release: https://github.com/StanfordAHA/Halide-to-Hardware/releases/tag/lakelib

dillonhuff commented 4 years ago

@Kuree I cannot reproduce this locally, but it looks like an error in deleting division instructions that I simplify into shifts. Do you this first build where this error happened?

dillonhuff commented 4 years ago

*do you remember the first build where this happened?

Kuree commented 4 years ago

This is the first build that has the div issue: https://buildkite.com/stanford-aha/garnetflow/builds/1476#fddb505e-d74c-4c86-82a5-aae72b9f399f

dillonhuff commented 4 years ago

@Kuree I cannot reproduce the issue on my garnetflow build either. That said I think it is caused by use of an unordered set of pointers in peephole divide optimizations. I have inserted a fix into Halide-to-Hardware master.

https://github.com/StanfordAHA/Halide-to-Hardware/pull/82

Can we change garnetflow to build Halide-to-Hardware from source instead of downloading a tagged library? The build already takes hours to run anyway and it would ensure the flow is up to date and hopefully we would catch some of these issues earlier.

Kuree commented 4 years ago

We can do that for daily regressions. I will work on a fix to enable the build from source if not PR.

On a side note, regarding the build speed I need to track down where it is slowing one.

Kuree commented 4 years ago

I have isolated the cause for 2x slow down in the regression. I will fix this after the performance bug is fixed.

Kuree commented 4 years ago

@dillonhuff https://buildkite.com/stanford-aha/garnetflow/builds/1512#a5a8cc27-711b-4ad0-b7ac-9b5294aa44e8 This build failed even though it built halide from scratch. Can you take a look?

dillonhuff commented 4 years ago

@Kuree this looks like a different error. The generator is segfaulting on cacade and the printouts stop very early. I'll try to recreate it in the travis tb.

dillonhuff commented 4 years ago

@Kuree I cannot get cascade to fail in garnetflow, but I can get a unit test of cascade to crash early in compilation in the garnetflow build on travis:

https://travis-ci.com/StanfordAHA/Halide-to-Hardware/builds/150709481

The error seems to happen before getting to coreir generation. @jeffsetter have you ever seen this error?

jeffsetter commented 4 years ago

I have never encountered that type of error.

jeffsetter commented 4 years ago

Is it failing cascade? Because near the end of the log it says "Generating coreir for function coreir_cascade" which suggests to me that it does get to coreir generation for cascade.

dillonhuff commented 4 years ago

@jeffsetter @Kuree sorry I wasnt clear. The build I linked to starts with a unit test that builds the cascade app, which runs to completion, passes, and then moves on to the rest of the unit tests. The rest of the unit tests contain another test with two convolutions back to back (which I also call cascade), that one fails with:

No linebuffer inserted after function conv2.
terminate called after throwing an instance of 'std::domain_error'
  what():  type must be number, but is string
./test/scripts/run_hw_unit_tests.sh: line 18: 14079 Aborted                 (core dumped) ./all-tests
Extracting testbench files...

Which seems to be before coreir code generation. I cannot get the cascade app itself to fail either on travis or locally.

Kuree commented 4 years ago

@dillonhuff To reproduce the cascade bug, can you attach to the docker container keyi-debug-flow on kiwi?

dillonhuff commented 4 years ago

@Kuree @joyliu37 when I run cascade in that container the code crashes inside the unified buffer rewrites (which now run on each execution of Halide-to-Hardware). It seems that the crash is here:

https://github.com/StanfordAHA/Halide-to-Hardware/blob/fa9fbb73a3aaa3e83cf2e3e0ce4dffdf391a5a20/src/UBufferRewrites.cpp#L832-L838

Joey do you have any idea why this would be crashing on kiwi?

Kuree commented 4 years ago

@joyliu37 Still fails: https://buildkite.com/stanford-aha/garnetflow/builds/1543#9278e7a8-da98-4172-a2ed-c5dd798315b3/6-7048

joyliu37 commented 4 years ago

There is a typo in gaussian, and it should be fixed in the newest PR.