Scientific sanity checks

mabruzzo commented 3 years ago

This issue is for discussing the design requirements for "gold standard testing" (since this is not necessarily obvious). I wanted to flesh out my own ideas and hear your ideas. As I write this out, I'm starting to think that we might not want to do this

High Level Description: The basic idea is to perform calculations under a standardized set of conditions and then cache the results (which serve as our "gold standard"). Then, after making changes we perform the same calculations and compare our result

Objectives: I would argue that there are 3 main objectives for this tool

Provide confidence that changes to one part of the code don't produce unexpected altered results in another part of the code.
Provide confidence that our intermediate calculations (e.g. continuum opacity) give an accurate answer. Unfortunately there is no "standardized answer set" for the intermediate calculations. In a sense the "gold standard" acts as our bootstrapped answer set.
(Optional) Possibly reuse the code both for unit tests and integration/end-to-end tests.

High-Level Design: To be useful, I think our tools need to consist of 3 pieces

We obviously need to run the test
We need to be able to generate new gold standard answers
We need to be able to easily visualize the difference between new results and the gold standard if there is disagreement. This might sound unnecessary, but I think this is crucial (without this, the rest of the framework becomes much more of a burden). I'm not sure what the best way to accomplish this is. (Ordinarily, I would recommend making a standalone command-line script. I'm less sure about what to do since this is Julia).

Other considerations:

What features are we using this approach to test?
Should this work through the traditional Julia unit-testing code? I'm starting to think that the answer is "no"
How do we want to serialize results?
- Should we package results for tests using similar functionallity together or keep one test per file?
- Should we use HDF5? If we package things in individual files, JSON might be better? If we want to use this for full integration tests, what is the output format of our synthetic spectra going to be?
- How much meta-data do we want to store in the result?
We need to make sure that there is a straight-forward procedure for introducing new tests
How do we set our tolerances?
Do we reuse the same results for different precisions? (It may be more necessary to keep a different answer set when producing full synthetic spectra)

Comparison to other testing approaches

"Gold standard" testing isn't the only tool at our disposal for verifying that our intermediate calculations give the correct result. Other types of tests include: 1. Comparing our synthetic spectra to the spectra for stars with known properties or the synthetic spectra produced by MOOG/Turbospectrum 2. There are some figures showing intermediate results of our calculations. For example, this is currently what we are doing with our continuum opacity calculations. 3. We could imagine extracting equivalent sections of code from MOOG/Turbospectrum and produce tables of expected results. Approaches 1 and 2 share the shortcoming that they don't provide strong constraints on all intermediate calculations. For example, consider a subdominant opacity source (that makes up say 1% of the total opacity). Approaches 1 and 2 could conceivably allow the result of the calculation to vary by 10%, in cases where the result should not have changed. Approach 1 also has the disadvantage that it's fairly expensive and it's not particularly useful for diagnosing where code got broken. Approaches 2 and 3 share the shortcoming that they are somewhat dependent on our choice of approximation. If our approximations differ from those used to produce the test answers, then the tests become less useful for addressing objective 1. In general, I would be much less keen on pursuing this "gold standard" approach, if approach 3 were easier to do. However, it's arguably the most difficult/error-prone because it would require us to significantly modify FORTRAN code. In general these other approaches provide a baseline for validating our "bootstrapped" answer tests.

ajwheeler commented 3 years ago

We talked about this on a call and I want to more clearly articulate what I think is the way to go. I propose that we write some code to generate a plots comparing two versions of the code to each other and to external data from observations and other packages.

I'm imagining one script to generate a bunch of data which is run on two versions of the code, and another script that ingests those files and makes several comparison plots. Hopefully this can be automated to some degree. Because we would generate the data on the fly, long-term storage isn't as much of a concern, although it's still relevant for storing external data. I think JLD2 is appropriate for internal data.

I think at least these plots should be included (all with big residuals panels):

a comparison to a solar spectrum with apples-to-apples binning and resolution
for a few different sets of stellar parameters (these should also include spectra from MOOG or turbospectrum):
- a full solar spectrum
- zoom-ins to a few lines of interest: perhaps H-alpha, the the Ca triplet, etc.
for the same sets of stellar parameters, but without comparison to other codes:
- plots of the number density of several species through the stellar atmospheres
- plots of the continuum and line opacities at a few atmospheric layers

I think that turning all this into a binary outcome ultimately wont be that helpful, since hopefully as the code improves our results will change for the better, and a picture is worth a thousand words.

ajwheeler commented 3 years ago

Suggestion: Gaia-ESO reference sample

andycasey commented 2 years ago

I have made some progress on implementing this:

https://github.com/andycasey/Korg.jl/actions/runs/3042613888

On every push it will compile MOOG, TurboSpectrum, install Python and Julia, install "grok" (to execute the tests), and install Korg from the package repository. It will then run a set of 'experiments', each one specifying a model photosphere, set of transitions, which spectral synthesis code(s) to execute, and various options. Right now it works in that it will execute one example with Korg and it will upload a figure of the resultant spectrum, but there are numerous to-do items:

[ ] Each artifact seems to be a zip file, meaning we can't upload a single image as an artifact and then reference that image in the build summary. Figure out a way around that.
- [ ] Include the reference images in the build summary (currently just "hello world" :rocket:)
[ ] Allow grok to take in many experiments in one YAML file so that we can execute all the Korg codes in one instance (reduce overhead time)
[ ] Decide what should be stored in an artefact of each build. The formatted input files for each execution? The output spectra? Just figures of the spectra? Intermediate calculations as well?
[ ] Each time GitHub actions is run we are interested to know what changed in the outputs. Should we define a reference set of spectra that were executed with some previous version of Korg? Or use the code from the previous commit?
[ ] Should we only execute MOOG/Turbospectrum when new experiments are added, and instead download their reference spectra for the on-push GitHub actions?
[ ] The experiment config files are currently stored in Korg.jl/tests/ScienceVerification... is that where they should live?
[ ] Running these experiments requires the transitions, photospheres, and potentially observed spectra. The transition lists can be pretty large (~tens of MB). Should we store these in the Korg.jl repository, or package them up and have GitHub actions retrieve them from Zenodo/elsewhere?
[ ] Right now this is installing the Korg version from the Julia package repository,... how do I get it to install from a GitHub repository (+ branch + hash?,.. eg the one triggering the build)

ajwheeler commented 2 years ago

Wow, this is cool!

When I looked into generating and posting images before, I concluded that we were going to have to host them elsewhere, but you probably know more about that than me at this point.

andycasey commented 2 years ago

I think we can use the imgur API. Free, but requires handing over a credit card in case we break free limits.

I think I need input from you guys about what other outputs you’d want saved or plotted for each time this is executed. My guess is that spectra alone is not enough.

ajwheeler / Korg.jl

Scientific sanity checks #10