Use reusable workflows for Tier tests

ashiklom commented 4 months ago

This takes a stab at partially addressing #9. Two common steps --- installing SWELL and running a SWELL suite --- have been refactored into a reusable workflow that takes the suite name and test tier as input. This means that running a particular SWELL suite now takes much less code; e.g.:

swell-tier_2-build_jedi:
    needs: swell-tier_2-setup
    uses: ./.github/workflows/run-swell-suite.yml
    with:
      tier: "tier2"
      suite: "build_jedi"

I also took a stab at using a workflow matrix to iterate over suites for Tier 1. This shortens the code even more because we can represent all the Tier 1 tests as:

  tier_1_matrix:
    strategy:
      matrix:
        suite: ["ufo_testing" "hofx" "3dvar"]
    steps:
      - uses: ./.github/workflows/run-swell-suite.yml
        needs: swell-tier_1-setup
        with:
          tier: "tier1"
          suite: ${{ matrix.suite }}

Some important caveats:

The run-swell-suite job is slightly janky in handling special cases for Tier 2 tests. There are a few things like if tier2 and jedi_bundle; then <do some stuff...> fi blocks. These may be OK for now, but if we start layering on complexity into that script, it'll quickly get out of hand. We may need to rethink that in the future.
This also doesn't implement what I originally proposed, which was pulling most of this logic out into shell scripts that are easier to run locally. That's because what we're defining here are workflows that are run by the SWELL repo, not this one, and I don't think there's a clean way to run shell scripts stored in this repo in workflows executed by the SWELL repo.

Dooruk commented 4 months ago

I see the logic here but I'm not sure if I'm able to test it with the Discover permissions I have. Is there a way to create final YAML configuration that Github Actions uses right before it starts running it without actually running it, @jardizzo or @mathomp4?

I guess a more general question would be, is there a better strategy (obviously, things can always be designed better 😄) for the on-prem CI-workflows process with the constraints we have (which would require a discussion)?

jardizzo commented 4 months ago

Hi, sorry to be late to the conversation. I'm still trying to comprehend the challenges. I see the comments about not being able to invoke scripts that are outside of the SWELL repo. However, the workflows can have the main requirements to clone and install all repo dependencies up front. The remaining logic should be able to inherit or assume where the installed codes are located on Discover. Apologies if I'm still not up to speed on where things stand. I'll try to spend more time on this.

jardizzo commented 4 months ago

Hi Doruk, it sounds like you want to test things up to a point (i.e. an abbreviated workflow)? If so, you need another sanctioned workflow entry for testing. By sanctioned I mean that we need to enter the test workflow into the allowable runners list. You can then dispatch that test workflow. Let me know if this is what you are asking.

Dooruk commented 4 months ago

Hi Doruk, it sounds like you want to test things up to a point (i.e. an abbreviated workflow)? If so, you need another sanctioned workflow entry for testing. By sanctioned I mean that we need to enter the test workflow into the allowable runners list. You can then dispatch that test workflow. Let me know if this is what you are asking.

Not sure if abbreviated workflow is what I'm asking. I would use the term "dry-run" but perhaps they are the same thing.

ashiklom commented 4 months ago

Is there a way to create final YAML configuration that Github Actions uses right before it starts running it without actually running it, @jardizzo or @mathomp4?

Not that I know of. I don't know that there's a way around this other than to run the workflow.

I guess a more general question would be, is there a better strategy (obviously, things can always be designed better 😄) for the on-prem CI-workflows process with the constraints we have (which would require a discussion)?

Agreed that this warrants discussion. But, here's a radical alternative:

We abandon the idea of GitHub actions to launch workflows and force users to run tests themselves. However, the role of GitHub actions is to automatically validate that a workflow was successful by checking for a specific set of diagnostics in a predictable location. Here's how this might work:

Before opening a PR, a user runs tests themselves. The tests automatically produce, among other things, a machine-readable (YAML? JSON?) result file whose name corresponds to the current GitHub commit hash.
The user "publishes" that result file to a specific location on the NCCS HTTP data portal. (Or even, as a JSON file attached to the PR. I think
When the user submits their pull request, GitHub Actions automatically grabs the results file, checks the results of the tests. If they meet certain criteria, the actions test passes. If the criteria aren't met, or if the correct test file doesn't exist, then the tests fail.

The main disadvantage of this is that users have to run their tests themselves (and that it requires completely redesigning our test interface). But, in practice, most SWELL developers should probably do this anyway. This also has a bunch of advantages:

It's much easier to run tests locally, without having to debug GitHub Actions-specific quirks.
Tests can be run with arbitrary resources available to the user; we don't have to rely strictly on dastest nodes.
This minimizes NCCS security concerns because GitHub interacts with NCCS Discover in a purely read-only way.
Related to the above, we can open up the ability to submit PRs with tests to SWELL (or other things) to anyone with NCCS Discover access.

Dooruk commented 4 months ago

We abandon the idea of GitHub actions to launch workflows and force users to run tests themselves.

Well, it is how it operates right now. Tier 1 tests gets triggered after a PR gets merged, so I (as the current maintainer) have to run Tier 1 tests to make sure nothing breaks first, essentially the user's burden falls onto the maintainer. So what you are suggesting is that we would have an automated CI + a way for users to manually trigger themselves locally? I can see this implemented in a way with single command that triggers multiple suites + configs. Or sayswell run upload tier1 tests command.

But before we get to that, are there any other alternatives than Github actions? Could Gitlab or CircleCI (which GEOSgcm already uses, see below) be more feasible alternatives (contingent upon if we can use the free services or pay nominal fees)?

https://app.circleci.com/pipelines/github/GEOS-ESM/GEOSgcm

Like I said, happy to discuss this further.

jardizzo commented 4 months ago

Hi Alexey and Doruk, I think we need to have a discussion. It sounds like you are debating items that were debated at some length back when Dan was still around. The initial idea was that the SWELL repo must remain as a public repo with potential outside users contributing. That eliminates gitlab. Any of the CI workflows can be set up to trigger on a PR. In fact, the workflows are written to pull the code associated with the PR. The cron-initiated runs use the latest committed codes on the develop or main branch so the same workflow codes work within the context of the initiating code base. However, we also do not want all users to be able to launch on-prem CI. The solution was to create labels under SWELL that only some users are authorized to have access. Those users can assign the label to a PR and then it will execute the CI workflows. Some code is needed within the dispatch workflow to check the label. Users can hack that logic but this is more or less a weak constraint to avoid too many CI workflow launches. We can save the rest for when we meet. Let me know if you want to have a special meeting on this.

ashiklom commented 4 months ago

So what you are suggesting is that we would have an automated CI + a way for users to manually trigger themselves locally? I can see this implemented in a way with single command that triggers multiple suites + configs. Or sayswell run upload tier1 tests command.

I'm actually suggesting that the automated CI does not execute workflows; it only checks that workflows have been executed successfully, by looking in a particular spot. So yes, your swell run tier1_tests --upload (or whatever) is what I had in mind.

But before we get to that, are there any other alternatives than Github actions? Could Gitlab or CircleCI (which GEOSgcm already uses, see below) be more feasible alternatives (contingent upon if we can use the free services or pay nominal fees)?

Maybe. As long as any of these systems are triggering NCCS Discover jobs from outside of NCCS Discover, they will have the same issues. But there may be good local-only CI solutions.

@jardizzo To be clear, the proposal here is that no workflows ever get executed from GitHub. The point is that we force users to execute tests themselves locally on Discover. It sounds like this is basically what @Dooruk already does right now, in addition to the automated GitHub CI.

But yes, maybe we need a meeting to make sure we're all on the same page to hash out some details.

jardizzo commented 4 months ago

Happy to meet when it is convenient. We worked with a set of requirements when the on-prem CI was implemented: (1) automated nightly tests to make sure that we stay in sync with JCSDA changes, (2) automated reporting on CI results, (3) public repo, (4) free, (5) no data transfers; especially for tier-3 expensive tests. Maybe we no longer have the same requirements? Interested in learning more about using GitHub actions to monitor only. There are some things we could do with the GH API from Discover if necessary. I suppose the GitHub initiated CI would basically lay an egg on Discover that would then be visited by processes on Discover that would execute the tiered CI using the information contained in the egg? Probably better to discuss verbally!!

Dooruk commented 4 months ago

Thanks @jardizzo for writing these down. Some quick thoughts before we have a discussion at some point:

1) Regarding nightlies, we are considering freezing the builds using hashes. We recently found out that EMC updates their JEDI builds monthly, and we are yet to define our frequency. Akira is working on this on our end:

https://github.com/GEOS-ESM/swell/issues/345

2) Reporting part is still valid, however GMAO, EMC, and JCSDA will likely share the same configuration tool builder. That way JCSDA could directly test jcb-gmao, or jcb-emc configurations to ensure nothing breaks with their changes without depending on our reporting. Of course, this is contingent upon GMAO's and JCSDA's commitment to the JEDI Config Builder tool.

3) yes 4) If there is a magical tool that fits our needs could we pay for it? 😃 5) Data transfer from?

Finally, I think the new Swell engineer will likely bring in some good insight and we could wait until we make a final CI/CD decision? So for now we could proceed with what we have + Alexey's modifications with simply changing the cron frequency for Tier 2 in the short term.

jardizzo commented 4 months ago

By data transfer I meant we tried to avoid any containerization process that would have necessitate moving large amounts of data. The "free" part refers to not paying any license fees for GitHub Enterprise (for example). I think we calculated $10-20K per year but I need to go back and check. I'm glad that it sounds like you have a new hire assigned to engineer SWELL? Not sure if that is the case but someone dedicated would go a long way to finding optimal solutions. It might be good to know what the biggest bottleneck or challenge is with the on-prem CI/CD. We could approach NCCS and see if they will be ok if we relax the runner restrictions. That would basically enable workflows to be more easily installed. It might be possible to give runners permission to execute all workflows on a designated branch under a given workflow repo. I do see the allure of implementing things on NCCS and moving away from GitHub Actions' convoluted YAML syntax. The only disadvantage is that you lose the automatic triggering to GitHub events. I assume NCCS workflow development will use CYLC in order to simulate parallelism and dependencies for tasks.

GEOS-ESM / CI-workflows

Use reusable workflows for Tier tests #11