ajwheeler / Korg.jl

fast 1D LTE stellar spectral synthesis
https://ajwheeler.github.io/Korg.jl/
BSD 3-Clause "New" or "Revised" License
37 stars 7 forks source link

Scientific sanity checks #10

Open mabruzzo opened 3 years ago

mabruzzo commented 3 years ago

This issue is for discussing the design requirements for "gold standard testing" (since this is not necessarily obvious). I wanted to flesh out my own ideas and hear your ideas. As I write this out, I'm starting to think that we might not want to do this

High Level Description: The basic idea is to perform calculations under a standardized set of conditions and then cache the results (which serve as our "gold standard"). Then, after making changes we perform the same calculations and compare our result

Objectives: I would argue that there are 3 main objectives for this tool

  1. Provide confidence that changes to one part of the code don't produce unexpected altered results in another part of the code.
  2. Provide confidence that our intermediate calculations (e.g. continuum opacity) give an accurate answer. Unfortunately there is no "standardized answer set" for the intermediate calculations. In a sense the "gold standard" acts as our bootstrapped answer set.
  3. (Optional) Possibly reuse the code both for unit tests and integration/end-to-end tests.

High-Level Design: To be useful, I think our tools need to consist of 3 pieces

  1. We obviously need to run the test
  2. We need to be able to generate new gold standard answers
  3. We need to be able to easily visualize the difference between new results and the gold standard if there is disagreement. This might sound unnecessary, but I think this is crucial (without this, the rest of the framework becomes much more of a burden). I'm not sure what the best way to accomplish this is. (Ordinarily, I would recommend making a standalone command-line script. I'm less sure about what to do since this is Julia).

Other considerations:

Comparison to other testing approaches "Gold standard" testing isn't the only tool at our disposal for verifying that our intermediate calculations give the correct result. Other types of tests include: 1. Comparing our synthetic spectra to the spectra for stars with known properties or the synthetic spectra produced by MOOG/Turbospectrum 2. There are some figures showing intermediate results of our calculations. For example, this is currently what we are doing with our continuum opacity calculations. 3. We could imagine extracting equivalent sections of code from MOOG/Turbospectrum and produce tables of expected results. Approaches 1 and 2 share the shortcoming that they don't provide strong constraints on all intermediate calculations. For example, consider a subdominant opacity source (that makes up say 1% of the total opacity). Approaches 1 and 2 could conceivably allow the result of the calculation to vary by 10%, in cases where the result should not have changed. Approach 1 also has the disadvantage that it's fairly expensive and it's not particularly useful for diagnosing where code got broken. Approaches 2 and 3 share the shortcoming that they are somewhat dependent on our choice of approximation. If our approximations differ from those used to produce the test answers, then the tests become less useful for addressing objective 1. In general, I would be much less keen on pursuing this "gold standard" approach, if approach 3 were easier to do. However, it's arguably the most difficult/error-prone because it would require us to significantly modify FORTRAN code. In general these other approaches provide a baseline for validating our "bootstrapped" answer tests.
ajwheeler commented 3 years ago

We talked about this on a call and I want to more clearly articulate what I think is the way to go. I propose that we write some code to generate a plots comparing two versions of the code to each other and to external data from observations and other packages.

I'm imagining one script to generate a bunch of data which is run on two versions of the code, and another script that ingests those files and makes several comparison plots. Hopefully this can be automated to some degree. Because we would generate the data on the fly, long-term storage isn't as much of a concern, although it's still relevant for storing external data. I think JLD2 is appropriate for internal data.

I think at least these plots should be included (all with big residuals panels):

I think that turning all this into a binary outcome ultimately wont be that helpful, since hopefully as the code improves our results will change for the better, and a picture is worth a thousand words.

ajwheeler commented 3 years ago

Suggestion: Gaia-ESO reference sample

andycasey commented 2 years ago

I have made some progress on implementing this:

https://github.com/andycasey/Korg.jl/actions/runs/3042613888

On every push it will compile MOOG, TurboSpectrum, install Python and Julia, install "grok" (to execute the tests), and install Korg from the package repository. It will then run a set of 'experiments', each one specifying a model photosphere, set of transitions, which spectral synthesis code(s) to execute, and various options. Right now it works in that it will execute one example with Korg and it will upload a figure of the resultant spectrum, but there are numerous to-do items:

ajwheeler commented 2 years ago

Wow, this is cool!

When I looked into generating and posting images before, I concluded that we were going to have to host them elsewhere, but you probably know more about that than me at this point.

andycasey commented 2 years ago

I think we can use the imgur API. Free, but requires handing over a credit card in case we break free limits.

I think I need input from you guys about what other outputs you’d want saved or plotted for each time this is executed. My guess is that spectra alone is not enough.