Open mabruzzo opened 3 years ago
We talked about this on a call and I want to more clearly articulate what I think is the way to go. I propose that we write some code to generate a plots comparing two versions of the code to each other and to external data from observations and other packages.
I'm imagining one script to generate a bunch of data which is run on two versions of the code, and another script that ingests those files and makes several comparison plots. Hopefully this can be automated to some degree. Because we would generate the data on the fly, long-term storage isn't as much of a concern, although it's still relevant for storing external data. I think JLD2 is appropriate for internal data.
I think at least these plots should be included (all with big residuals panels):
I think that turning all this into a binary outcome ultimately wont be that helpful, since hopefully as the code improves our results will change for the better, and a picture is worth a thousand words.
Suggestion: Gaia-ESO reference sample
I have made some progress on implementing this:
https://github.com/andycasey/Korg.jl/actions/runs/3042613888
On every push it will compile MOOG, TurboSpectrum, install Python and Julia, install "grok" (to execute the tests), and install Korg from the package repository. It will then run a set of 'experiments', each one specifying a model photosphere, set of transitions, which spectral synthesis code(s) to execute, and various options. Right now it works in that it will execute one example with Korg and it will upload a figure of the resultant spectrum, but there are numerous to-do items:
grok
to take in many experiments in one YAML file so that we can execute all the Korg codes in one instance (reduce overhead time)Korg.jl/tests/ScienceVerification
... is that where they should live?Korg.jl
repository, or package them up and have GitHub actions retrieve them from Zenodo/elsewhere?Wow, this is cool!
When I looked into generating and posting images before, I concluded that we were going to have to host them elsewhere, but you probably know more about that than me at this point.
I think we can use the imgur API. Free, but requires handing over a credit card in case we break free limits.
I think I need input from you guys about what other outputs you’d want saved or plotted for each time this is executed. My guess is that spectra alone is not enough.
This issue is for discussing the design requirements for "gold standard testing" (since this is not necessarily obvious). I wanted to flesh out my own ideas and hear your ideas. As I write this out, I'm starting to think that we might not want to do this
High Level Description: The basic idea is to perform calculations under a standardized set of conditions and then cache the results (which serve as our "gold standard"). Then, after making changes we perform the same calculations and compare our result
Objectives: I would argue that there are 3 main objectives for this tool
High-Level Design: To be useful, I think our tools need to consist of 3 pieces
Other considerations:
Comparison to other testing approaches
"Gold standard" testing isn't the only tool at our disposal for verifying that our intermediate calculations give the correct result. Other types of tests include: 1. Comparing our synthetic spectra to the spectra for stars with known properties or the synthetic spectra produced by MOOG/Turbospectrum 2. There are some figures showing intermediate results of our calculations. For example, this is currently what we are doing with our continuum opacity calculations. 3. We could imagine extracting equivalent sections of code from MOOG/Turbospectrum and produce tables of expected results. Approaches 1 and 2 share the shortcoming that they don't provide strong constraints on all intermediate calculations. For example, consider a subdominant opacity source (that makes up say 1% of the total opacity). Approaches 1 and 2 could conceivably allow the result of the calculation to vary by 10%, in cases where the result should not have changed. Approach 1 also has the disadvantage that it's fairly expensive and it's not particularly useful for diagnosing where code got broken. Approaches 2 and 3 share the shortcoming that they are somewhat dependent on our choice of approximation. If our approximations differ from those used to produce the test answers, then the tests become less useful for addressing objective 1. In general, I would be much less keen on pursuing this "gold standard" approach, if approach 3 were easier to do. However, it's arguably the most difficult/error-prone because it would require us to significantly modify FORTRAN code. In general these other approaches provide a baseline for validating our "bootstrapped" answer tests.