[CI] Status / Release builds with assertions

Flakebi commented 4 years ago

Moving the discussion with @kuhar from #634 to an issue to not clutter up pull requests too much.

I didn’t make progress on release builds with assertions, the ‘Fast Jenkins Build’ is doing a debug build with sanitizers, so we currently have a CI build with assertions switched on. There are some outstanding issues with this, most prominently, you cannot see the error messages without going to the Jenkins website. There are some falsely failing builds/tests (sorry @trenouf, you got most of them) caused by rebuilding the base image once a day only and some failing builds because of md5 sums of the cache not matching (whatever goes wrong there…). I plan to fix these and also run CTS with sanitizers, but getting there will take time, especially as we will need to build own infrastructure for some of these things.

Sanitizer support got pushed to GitHub today (yay :tada:), so we can add GitHub CI builds. Things needed in the GitHub workflow:

Set -DXGL_USE_SANITIZER="Address;Undefined"
Set the environment variable ASAN_OPTIONS=detect_leaks=0, I would appreciate seeing no leaks, someone has to fix them though
I think libclang-common-9.0-dev needs to be installed
There is one patch that I think still needs to be applied to drivers/spvgen/external/glslang: https://gist.github.com/Flakebi/530ff13056407237fdbdeaab5dd740bf This was merged upstream a while ago but nobody updated our pinned version.

So, about assertions in release builds. There does not seem to be a clean way to do this as CMake adds -DNDEBUG to release options by default. I see two options:

Follow the way LLVM implements LLVM_ENABLE_ASSERTIONS, which means using regex to strip out NDEBUG from compilation flags.
Add a new RelWithAsserts target that sets release flags manually and does not add NDEBUG. This sounds like it needs every subproject to support this target in the as Release, which does not make it appealing either.

Any opinions on this?

kuhar commented 4 years ago

Thanks for working on this, Sebastian.

I didn’t make progress on release builds with assertions, the ‘Fast Jenkins Build’ is doing a debug build with sanitizers, so we currently have a CI build with assertions switched on. There are some outstanding issues with this, most prominently, you cannot see the error messages without going to the Jenkins website. There are some falsely failing builds/tests (sorry @trenouf, you got most of them) caused by rebuilding the base image once a day only and some failing builds because of md5 sums of the cache not matching (whatever goes wrong there…). I plan to fix these and also run CTS with sanitizers, but getting there will take time, especially as we will need to build own infrastructure for some of these things.

Yep, the jenkins builds don't produce much information useful to the external contributors right now. As an alternative for the fast builder, have you considered hosting a github actions runner with a machine beefy enough to handle debug builds? This won't solve the issue of running CTS, as it requires physical GPUs.

I think libclang-common-9.0-dev needs to be installed

This should be already installed IIRC.

So, about assertions in release builds. There does not seem to be a clean way to do this as CMake adds -DNDEBUG to release options by default. I see two options:

I think that while the LLVM approach is not super elegant, it seems to work pretty well and I wouldn't mind adding something similar for PAL/XGL.

kuhar commented 4 years ago

@Flakebi As for debug builds, I think we should be able to use gcloud builds for debug builds on GCP instead running docker build locally on the current github actions runner: https://cloud.google.com/cloud-build/docs/building/build-containers. The way I understand this, it's basically running the same docker command on a some remote machine and sending back the logs, so we should see the same logs as now.

The benefit over using self-hosted runners is that it's either expensive to have VM running all the time and waiting for jobs, or time consuming to setup some smarter solution that would acquire and release VMs on demand as builds come and go.

This could run on the stadia-open-source GCP project. I can follow up on that if that's something you would be interested in.

Flakebi commented 4 years ago

Yep, the jenkins builds don't produce much information useful to the external contributors right now. As an alternative for the fast builder, have you considered hosting a github actions runner with a machine beefy enough to handle debug builds? This won't solve the issue of running CTS, as it requires physical GPUs.

I really like GitHub actions, their declarations and the ui, Jenkins feels much more “enterprise” (meaning overly complicated, my Jenkinsfile is about 400 lines and it took two weeks to get everything working). There are two problems with it for which I currently don’t know a solution:

Eventually, I’d like to test LLVM patches on phabricator automatically. I’m not quite sure how doable that is with the volume of changes there, but I like testing code before it gets pushed.
More importantly, it feels risky to run arbitrary and potentially harmful code on a machine with internet access. Even when cutting down permissions, I think it’s hard to prevent sending spam mails or DoS-ing some servers.

So my current plan is to download and setup things with a static script (maybe it makes sense to put it into the repo but it won’t be fetched from PRs) and run the build and tests without network access.

Flakebi commented 4 years ago

Things, that I think are still missing from CI:

Running parts of CTS with sanitizers and assertions (should probably be a release build)
Measuring run-time performance
Measuring compile-time performance
More static analyzers: cppcheck, maybe oclint, @kuhar mentioned clang-tidy and clang static analyzer

If you have other ideas, I’d be interested to hear them.

Long-term statistics for run- and compile-time performance would be interesting. Performance measurements don’t need to run for every PR but the ability to request them would be useful. LLVM changes definitely come into play here and when there is the possibility to do performance measurements easily, I’d like to reuse it between llpc and llvm.

Thinking about it, the needed builds are then release+sanitizers+assertions and plain release. So, to take load off that to-be-built system, running a debug build on GCP sounds like a nice idea.

The benefit over using self-hosted runners is that it's either expensive to have VM running all the time and waiting for jobs, or time consuming to setup some smarter solution that would acquire and release VMs on demand as builds come and go.

I agree, using a pool of machines like GitHub actions or Jenkins would definitely be better. But for running untrusted builds and running things on GPUs, I don’t think we have a choice. Unless of course the stadia-open-source GCP project has machines with suitable GPUs, then we could run CTS there and with some care performance measurements probably too :)

Flakebi commented 4 years ago

Small status update: XGL_ENABLE_ASSERTIONS was merged internally, so it should appear on GitHub in two weeks.

kuhar commented 4 years ago

Measuring run-time performance Measuring compile-time performance

A thing a good proxy for finding performance regressions without having to use machines with GPUs might be running offline compilation of pipeline dumps and then running some statistics over the generated elfs. Mesa does something similar: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3151. I think this could be a good starting point to at least be able to tell that a commits is likely to be performance-affecting or not (based on vpgr counts, instruction counts, etc.).

Unless of course the stadia-open-source GCP project has machines with suitable GPUs, then we could run CTS there and with some care performance measurements probably too :)

I don't think GCP has any machines with AMD cards right now. If we were to use some internal VMs, the best we could do would be probably the same as the Jenkins builds, i.e., producing a short execution log with no way of reproducing failures.

Flakebi commented 4 years ago

As a heads up to CI maintainers (not sure how much this applies to the internal jobs), LLVM will start to increase the minimum CMake version to 3.13.4 in about two weeks. Reference: D78646

I somehow need to upgrade the CMake version of the ‘fast jenkins build’. @kuhar, the CMake version in the GitHub CI jobs won’t survive this.

kuhar commented 4 years ago

Thanks for pushing on this, @Flakebi. What is missing before we can enable sanitizers in CI builds?

kuhar commented 4 years ago

PR with sanitizers support: #805.

kuhar commented 4 years ago

We have CI bots running builds with assertions and sanitizers enabled now. I think this should cover the scope of this issue. @Flakebi can we close this issue, or is there anything else left?

Flakebi commented 4 years ago

It’s done, thanks for the reminder!

_{CTS with asan would be nice but that’s for another time}

GPUOpen-Drivers / llpc

[CI] Status / Release builds with assertions #727