ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
225 stars 151 forks source link

Guard against inconsistent state #2041

Closed ellosel closed 3 weeks ago

ellosel commented 3 weeks ago

We keep hitting a failure in the extended pipelines that I can't reproduce. A cache file containing the computed capabilities is created when running the build client step with Tensile. The cache file is then reused in the subsequent step where we run precheckin or extended tests and reduces runtime by about 30 seconds per test. We see a spurious failure where the cache file is read in and the resulting dictionary is empty. The only way this is possible is if the file is somehow deleted during the run. This new logic guards against this program state where we see a file on disk but after reading it in the resulting dictionary is empty. In this case we return None which is equivalent to disabling the cache for a given test.