hol353 commented 1 year ago

Describe the new feature

Problems:

The build system is too slow (around 30mins run time).
The build system and web sites is costing ~$1200/month.
It is difficult to run builds and tests on a local machine. e.g. need to be able to run a subset of the tests on a local computer and look at stats.
Not scalable. We currently use a single 96 core Google Compute VM. Need to be able to use > 500 VM nodes for very fast turnaround.

Some related issues:

8010 - Intermittent problem with unit tests on jenkins
5180 - Jenkins status is not always updated after requesting test reruns
6800 - Provide a set of instructions of how to increase build compute resources prior to a Sprint

Design decisions

Can we convert all .apsimx files in the test suite into separate simulations that are run asynchronously? Will need a way of capturing all outputs from all simulations and then combine into a stats table.
7100 - Remove model description out of autodocs. This will clean up code and speed builds.
8196 - Need to rethink how we create windows releases. Can we do them on a linux vm instead?
Currently releases are built as a separate step and all 3 releases (windows, linux, macos) are built synchronously - slow. Should be run async.
I think currently docker images are built at the end of the build system. I think they should be first and then the tests are run inside of the newly created docker images.
Need a single entry script, that can be run locally, where you specify which parts of the build you want to run .e.g options for 'all', 'build', 'releases', 'docker images', 'autodoc', 'unit tests', 'validation', 'prototype', '*validation/wheat.apsimx' or any combination of these options. Jenkins will always run the 'all' option.
Should we try and use GitHub Actions instead of Jenkins?

hol353 commented 1 year ago

Capturing some more thoughts:

When a pull request (PR) is raised, the following workflow is created by a tool/script written by us. The tool would scan .apsimx files and extract simulations that need running. All simulation runs will write to .csv files.

The workflow then needs to be executed somewhere. Workflow items below to happen asynchronously but with dependencies.

The workflow groups simulations in batches of 1000. This should be configurable.

workflow:

build apsim
push docker images.
depends on 2: run apsim for 1000 simulations of wheat and send csv output to temp storage
depends on 2: run apsim for 1000 simulations of wheat and send csv output to temp storage
depends on 2: run apsim for 1000 simulations of wheat and send csv output to temp storage
depends on 2: run apsim for 1000 simulations of wheat and send csv output to temp storage
depends on 2: run apsim for 1000 simulations of wheat and send csv output to temp storage
depends on 2: run apsim for 1000 simulations of wheat and send csv output to temp storage
depends on 2: run apsim for 200 simulations of wheat and 800 simulation of barley send csv output to temp storage
depends on 2: run apsim for 1000 simulations of barley and send csv output to temp storage
depends on 2: run apsim for 1000 simulations of barley and send csv output to temp storage
depends on 3-9: get csv values for wheat, build doc and send predicted-observed to POStats API.
depends on 9-11: get csv values for barley, build doc and send predicted-observed to POStats API. ...
depends on 3-13: create a Windows release of APSIM (using a linux container)
depends on 3-13: create a Linux release of APSIM (using a linux container)
depends on 3-13: create a OSX release of APSIM (using a linux container)
depends on 14-16: Send pass/fail status flag to GitHub.

The APSIM Initiative is already using single 96 core Google Compute resource which isn't scalable slow.
The above workflow can potentially change with each PR because new simulations or models are added. Can this be done on GitHub Actions? I can't see a way that we can dynamically create a .yml on the fly. We don't want to pay for GitHub Actions compute time waiting for the tests (100,000 simulations) to complete.
Where to run this workflow - Azure/AWS Batch, AWS DevOps, Google Compute, Dug (HPC - real hardware).
Azure Batch is good at running workflows like the above where tasks can depend on other tasks. Azure Batch can run the above workflow job easily. I suspec AWS can too.
HPC isn't typically used for CI/CD because jobs are queued and HPC will run them when resources are available (DUG not suitable).
DevOps solutions (e.g. AWS and Microsoft) don't seem to scale the testing part to cloud resources. They assume tests run quickly? They can use Kubenetes to deploy the final product which isn' what we want in this case.

jbrider commented 1 year ago

Is it possibly to load a large number of these tests into memory and just change parameters before rerunning different configurations. For trials that only run for 1 season, it makes a huge difference.
do we need to run 6000 wheat tests to know something changed? Can we rely on a smaller set?
can we find a reliable way to separate gui changes from model changes?
can we add some performance testing to be able to compare runtimes to ensure the model isn't getting slower.

peter-devoil commented 1 year ago

Even at 1200/month, you're still ahead over buying your own hardware every 3 years. Its a legitimate cost of operation, just like salaries.

If the 96 core machine is inadequate, you can spread the compute load across more of them. Setting up a shared network amongst a group of VMs isnt hard (in google compute or openstack) - so all simulations could see the same "disk" area, no need for complicated file transfer. You're already exploring methods to aggregate simulations; there shouldnt be a need to change output formats for that.

It's worth being sure there isn't a IO bottleneck here - only this morning I was spammed to buy a 192 core gaming machine. They're not far away..

Would like to think that we could do platform specific tests as well - eg to be sure the mac installer hasn't broken again..

And of course - are we sure that this testing is telling us something useful?

hol353 commented 12 months ago

Is it possibly to load a large number of these tests into memory and just change parameters before rerunning different configurations. For trials that only run for 1 season, it makes a huge difference.

Yep I've thought about this and agree it would be much quicker. It's a big job though converting our existing validation data sets (~12,000 simulations) into this way of running.

do we need to run 6000 wheat tests to know something changed? Can we rely on a smaller set?

We're not just looking for something to change. We're trying to convince ourselves that the model works and that a model stays validated across a broad range of GxExM. Most modellers I talk to what more tests, not less.

can we find a reliable way to separate gui changes from model changes?

Yes, this would be nice. Am I brave enough to say that any change to GUI code or documentation won't break a model validation?

can we add some performance testing to be able to compare runtimes to ensure the model isn't getting slower.

Yes, we need to do this!

Even at 1200/month, you're still ahead over buying your own hardware every 3 years. Its a legitimate cost of operation, just like salaries.

Agreed, we don't want to go back to buying our own hardware.

If the 96 core machine is inadequate, you can spread the compute load across more of them. Setting up a shared network amongst a group of VMs isnt hard (in google compute or openstack) - so all simulations could see the same "disk" area, no need for complicated file transfer. You're already exploring methods to aggregate simulations; there shouldnt be a need to change output formats for that.

True. Having 2 96 core machines will almost double the cost. They are much more expensive than say 200 dual core VMs. I'm not too worried about changing output file format. APSIM already supports CSV output via a command line switch. CSV is super easy to work with and upload to the POStats web api.

Would like to think that we could do platform specific tests as well - eg to be sure the mac installer hasn't broken again..

Agreed. I guess we need to write a test that installs the install to make sure it works.

And of course - are we sure that this testing is telling us something useful?

Agreed. It is telling us the models stay validated when we change something. It's a rather brute force way of doing that though. I do wonder if there is a simpler way as @jbrider alludes to above.

jbrider commented 12 months ago

@hol353 I'm not suggesting we don't need to run all 6000 at some stage - I agree in having more and better tests - just not part of every build.

We have history for every time the stats changed - could we pull out the 200 wheat simulations that changed the most, or most regularly : ie the most sensitive to changes? If the time invested reduces the workload by 50% it's going to pay for itself.
If you could link it back to model changes rather than GUI changes, then you could add a full test suite every 'n' model changes when it's quiet (midnight in AUS)?
With GUI changes - we're not testing most GUI changes when we run these simulation tests - they may affect the model results when the user interacts with the GUI - but we're most likely not going to catch that with the current system anyway (a separate issue).
The developer should have some confidence in it's lack of impact and could tag the PR as not model significant? That may require some additional oversight/permissions - but it would have to reduce a significant part of the total workload?
You could start building a reduced test set that is run in these GUI only instances as a backup process?

lie112 commented 12 months ago

I had written a response last week but wasn't sure how much I understood the Git, Validation and Build process so deleted it. I too was thinking of how to reduce the load to just the builds that were needed. It was way too easy with no real understanding of the cost for a full rebuild pull request (@Resolves XXXX). I also thought GUI updates wouldn't really need to run full validation before build other than unit tests. I also thought there might be a way where pull requests that aren't critical but you also don't want to wait the unknown amount of time before someone else has a request resulting in a rebuild during quite times. If there was a regular (say Friday night) build that just did all outstanding commits we'd at least know all changes would be in the Friday night upgrade and could work to that schedule. This is equivalent to @Working on XXXX but we'd want the issue to be closed. Is there another Git tag we could use? For any code changes that are within Models.CLEM do we need to run all the wheat validations. Can we have a smarter way of knowing what namespaces fire what validation?

But that led me to realise that just to get a nod that code upgrades are be accepted we needed the whole validation to be run so we know it didn't break anything. I assume this process for each and every pull is what is contributing to the server CPU time. How often do changes result in a true validation fail on the build process?

Anyway, this isn't answer but might provide a few ways to approach this differently.

APSIMInitiative / ApsimX

Refactor the APSIM build system #8230

Describe the new feature

Problems:

Some related issues:

8010 - Intermittent problem with unit tests on jenkins

5180 - Jenkins status is not always updated after requesting test reruns

6800 - Provide a set of instructions of how to increase build compute resources prior to a Sprint

Design decisions

7100 - Remove model description out of autodocs. This will clean up code and speed builds.

8196 - Need to rethink how we create windows releases. Can we do them on a linux vm instead?