GitLab CI testing expansion #1644

Open long58 opened 1 month ago

Summary

This PR is a feature
It does the following:
- Adds SYCL CI checks on corona using the custom intel compiler
- Adds instructions for building and installing the custom intel compiler
- Adds OpenMP Target offload CI checks on lassen using ibm clang
- Fixes GitLab cross project testing for RAJAPerf on changes to RAJA develop branch.

In order to accomplish this, I needed to make changes to the radiuss_spack_configs github project to add new compilers, and modify the raja and camp package.py files. I plan to make another pull request there integrating those changes, but for now, my branch of radiuss spack configs (used here) is up to date with the latest version of radiuss_spack_configs/develop.

Also, some (~9) of the OpenMPTarget tests on lassen are not consistent in whether they pass or fail. They will pass sometimes, and fail other times, and I'm really not entirely sure why.

This feature completes all objectives of Github issue #1611

@long58 I started reading your PR.

First, thank you for fixing the multi-project! I was actually looking at that today, and couldn’t see what was wrong. I think GitLab changed the syntax at some point to remove those quotes. So, thanks for the help. Now, I see in RAJAPerf build_and_test.sh script that the multi-project will import RAJA@develop whatever the branch actually passed as "MP_BRANCH". This should be fixed to support any reference used (let’s start with the branch, but I think supporting commit hash would be useful too). I can work on that, but in this PR, we should at least use the branch passed with "MP_BRANCH".

We can see that the status reports a failure named "skipped-draft-pr". I think this will go away if you merge "develop" in you branch.

I’ll complete the review tomorrow.

@long58 good work on this so far. I'll review in more detail tomorrow. For now, I suggest that we remove the OpenMP target testing from this branch. Then, we can work on the OpenMP target stuff in a separate PR. It will take some effort to try various compilers to figure out what works. We've always had difficulty with compiler support for that back-end.

@adrienbernede I think it would be best to test RAJAPerf against the PR branch to be merged. However, it would be sufficient I think to see whether RAJA Perf passes or fails (so we can fix breakages in RAJA Perf) without requiring that check to pass before merging the RAJA branch. Is that what you have in mind?

@rhornung67 the new data point I find in your comment is that you don’t want to bind the RAJA PR status to the RAJAPerf status. I am find with that, and it can be achieve removing the strategy: depend in the trigger job (see .gitlab-ci.yml in RAJA.)

@rhornung67 The intel 19.1.2 + gcc 10.3.1 job is timing out on poodle. Poodle interactive jobs have a 30 min hard limit. If the intel job is gonna be longer than 30 minutes on a regular basis, then we should consider options:

deactivate that job on poodle
shorten that job somehow (if that were possible, I guess you’d have done so already)
use batch allocations Using batch allocations does not suits CI very well IMHO. I would advocate against it.

Given that poodle is quite small and allocations limited, I would argue that our CI on it should be more targeted. Currently any ruby job is also run on poodle, is that really necessary ?

@adrienbernede I think we can remove that job. No users are exercising that compiler anymore, as far as I know. I suggest adding a job for intel/2023.2.1, which is the latest oneapi compiler.

@rhornung67 the new data point I find in your comment is that you don’t want to bind the RAJA PR status to the RAJAPerf status. I am find with that, and it can be achieve removing the strategy: depend in the trigger job (see .gitlab-ci.yml in RAJA.)

Fixed this in my most recent commit.

@long58 good work on this so far. I'll review in more detail tomorrow. For now, I suggest that we remove the OpenMP target testing from this branch. Then, we can work on the OpenMP target stuff in a separate PR. It will take some effort to try various compilers to figure out what works. We've always had difficulty with compiler support for that back-end.

@adrienbernede I think it would be best to test RAJAPerf against the PR branch to be merged. However, it would be sufficient I think to see whether RAJA Perf passes or fails (so we can fix breakages in RAJA Perf) without requiring that check to pass before merging the RAJA branch. Is that what you have in mind?

I went ahead and deactivated the OpenMP Target job, so that we can spend more time on it in a separate PR, as you suggested.

Yes, let's tackle OpenMP target testing in a separate PR. @long58 I owe you some guidance on compiler choice. I hope to get back to you on this sometime this week.

@long58 while trying to fix the conflicts in radiuss-spack-configs for you, I stumbled on something: https://github.com/LLNL/radiuss-spack-configs/pull/103#pullrequestreview-2102132816

@long58 apparently Spack does not like your new compiler. I’m trying to find why...

@long58 apparently Spack does not like your new compiler. I’m trying to find why...

The Spack error message is pretty cryptic:

==> Error: concretization failed for the following reasons:

Cannot set the required compiler: raja%clang@19.0.0.sycl.gcc.10.3.1

I double-checked that service user rajasa is in the raja-dev group which is required to access the compiler. Could there be another permission issue someplace?

@adrienbernede It looks like the update from spack version 02-18 to 05-26 broke the concretization of the new compiler. Any thoughts as to why this might be the case? Everything works when I change the spack version back to the old one.

For the records, my impression so far is that there is a bug in Spack with the detection of the compiler @long58 added. Clang@19.0.0 does not seem to be an official release. Testing suggests that naming the compiler clang@19.0.0 would be fine (spack would extrapolate) but clang@19.0.0.something is too much. I haven’t looked further than that yet. If changing the spec to clang@19.0.0 is a reasonable workaround, I suggest you try that.

That does seem to work based on my testing. I'm fine with doing that for the sake of having it work, but I do think it would be a lot clearer if the extra stuff at the end were there 🤷🏻‍♀️. Since it's a spack bug though, I guess we could always change the name once they fix the bug. I've updated the spec, as well as the compiler def in the radiuss spack configs repo.

@long58 The build started but timed out. We should bump up the allocation time. Also, there are a lot of compiler warnings about unused variables, etc. When this PR passes checks and is good to merge, can you start a new PR branch and try to resolve the compiler warnings. Many appear easy to fix (unused variables, etc.).

Also, I think I found a path forward for adding a OpenMP target offload testing pipeline on lassen. It's suboptimal, but will at least lest us check compilation. I'll start a PR for this.

@rhornung67 I don't think it timed out actually, I think it just failed for some other reason. It failed in 14min 30 sec, but the timeout limit is 3 hours. Unfortunately, because of all of the compiler warnings, GitLab is truncating the output, so I can't actually see what the problem is. I think I might make a branch off of this one for the compiler warnings and fix some of them, just so I can actually see what the issue is that's killing it in the CI.

@long58 sounds like a good path

@long58 GitLab is out a.t.m. but I have some information for you:

Logs too long to display:

When a job is too long to display, you can still see the full log by hitting the raw button at the top of the job logs. This will display the full, raw, log.

Time out:

There are several ways for a job to time out it GitLab:

Timing out waiting for a runner
Timing out because the job lasted too long (the 3 hours you mention)
Timing out because the allocation itself timed out. Rich was referring to the later. In our pipelines, all the jobs on a given machine run within the same shared allocation defined in .gitlab/custom-jobs-and-variables.yml. For corona, it looks like:
```
CORONA_SHARED_ALLOC: "--exclusive --time-limit=60m --nodes=1 -o per-resource.count=2"
```
The time-limit is 60 minutes , shared between all the jobs, but 2 jobs can run at the same time (per-resource.count). So, if the jobs running before the failing one already had consumed 45m and 30sec, then it is possible that it timed out because of the allocation.

Time out messages: On corona and tioga, flux is pretty explicit. On lassen, lrun would give you an obscure type 2 error. On ruby and poodle, srun should be rather clear too.

@long58 @adrienbernede note that LC is doing maintenance on CZ GitLab and is expected to be down until Monday morning (7/8)

@adrienbernede I had tried the complete log before, but it still doesn't show the entire output. It shows the log from the beginning, but ends with "Job's log exceeded limit of 32866304 bytes. Job execution will continue but no more output will be collected." As for the timeout, I'll try upping the time limit and try again.

Merging develop into this branch seemed to fix the CI build issues.

Isn't this fun?!?

LLNL / RAJA