davidorme commented 6 months ago

Description

This PR fixes #184. The changes:

Add a local copy of the original SPLASH code into pyrealm_build_data/splash/splash_py_version. This has been lightly modified to allow it to be imported as a package, and some code formatting has happened but this code is deliberately excluded from mypy and flake8 checking.
Python scripts have been added to pyrealm_build_data/splash that generate or format the test input datasets and then feed the data into the original SPLASH code to generate the golden datasets.
A PROVENANCE.md file documents the source and modification of the original code.
A README.md file describes the commands used to generate the golden datasets.

ETA: Although there are a lot of files changed, many of them are data files and the code files in splash_py_version should not be reviewed as they are intentionally a clone (with very minor edits for paths and formatting style) of the reference implementation.

Still to do:

[x] Track down issue that is causing the current golden outputs to differ slightly from those generated by the added files.
[x] Tidy up - probably splitting files in pyrealm_build_data/splash into scripts and the existing data folder.

Type of change

[x] New feature (non-breaking change which adds functionality)
[ ] Optimization (back-end change that speeds up the code)
[ ] Bug fix (non-breaking change which fixes an issue)

Key checklist

[x] Make sure you've run the pre-commit checks: $ pre-commit run -a
[x] All tests pass: $ poetry run pytest

Further checks

[x] Code is commented, particularly in hard-to-understand areas
[x] Tests added that prove fix is effective or that feature works

codecov-commenter commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.80%. Comparing base (8ace3d7) to head (d16f792).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #185 +/- ## ======================================== Coverage 93.80% 93.80% ======================================== Files 28 28 Lines 1468 1468 ======================================== Hits 1377 1377 Misses 91 91 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

davidorme commented 6 months ago

Well that was no fun.

I was unable to duplicate the original test datasets when re-running the dataset creation. Nothing major but small float errors. Turns out I was passing the Julian day of the year into the old SPLASH run_one_day function as a single value xarray.DataArray rather than as an integer. That turned out to have a massive performance hit (960 seconds -> 26 seconds) but also seems to have done something truly odd with the calculations resulting in those floating point issues. It’s an int64 DataArray, so it isn’t being treated as a float. No idea what is going on here, but very puzzling, and wasn't happening when I created the datasets originally.

MarionBWeinzierl commented 6 months ago

Question: Are all the CSV and NetCDF files which would be uploaded to the repository in this PR needed to generate the datasets? Could an alternative way of storing the data (figshare or Zenodo, for example) make sense? Github is not really the place to store a lot of data, so many now would be the time to think about a PyRealm data repository somewhere else?

davidorme commented 6 months ago

Could an alternative way of storing the data (figshare or Zenodo, for example) make sense?

Definitely could. I've tried to keep them reasonably small - and they are deliberately not in the main pyrealm package so that they don't bloat it. They get used in the documentation as annotated examples for usage, but I guess we could have a roll-your-own LFS style thing that fetches a Zenodo deposit? I'm inclined to leave it for now, but it is a good point.

MarionBWeinzierl commented 6 months ago

Could an alternative way of storing the data (figshare or Zenodo, for example) make sense?

Definitely could. I've tried to keep them reasonably small - and they are deliberately not in the main pyrealm package so that they don't bloat it. They get used in the documentation as annotated examples for usage, but I guess we could have a roll-your-own LFS style thing that fetches a Zenodo deposit? I'm inclined to leave it for now, but it is a good point.

I created an issue (#189) for this and put it into the icebox for now.

tztsai commented 6 months ago

Why is splash_py_version added to clone the functionality of pyrealm.splash?

davidorme commented 6 months ago

Why is splash_py_version added to clone the functionality of pyrealm.splash?

@tztsai That directory contains a previous implementation that had all sorts of implementation and usability issues. This PR is all about using the previous version to generate some regression test inputs to validate I have re-implemented the approach successfully.

davidorme commented 6 months ago

@surbhigoel77 and @tztsai - Can someone give an approving review so I can merge 😄

I've made a couple of minor additions to make it easier to generate benchmarks and access the reference implementation from other code.

ImperialCollegeLondon / pyrealm

Add code to generate SPLASH golden datasets #185

Description

Type of change

Key checklist

Further checks

Codecov Report