ImperialCollegeLondon / pyrealm

Development of the pyrealm package, providing an integrated toolbox for modelling plant productivity, growth and demography using Python.
https://pyrealm.readthedocs.io/
MIT License
19 stars 8 forks source link

Add code to generate SPLASH golden datasets #185

Closed davidorme closed 6 months ago

davidorme commented 6 months ago

Description

This PR fixes #184. The changes:

ETA: Although there are a lot of files changed, many of them are data files and the code files in splash_py_version should not be reviewed as they are intentionally a clone (with very minor edits for paths and formatting style) of the reference implementation.

Still to do:

Type of change

Key checklist

Further checks

codecov-commenter commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.80%. Comparing base (8ace3d7) to head (d16f792).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #185 +/- ## ======================================== Coverage 93.80% 93.80% ======================================== Files 28 28 Lines 1468 1468 ======================================== Hits 1377 1377 Misses 91 91 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

davidorme commented 6 months ago

Well that was no fun.

I was unable to duplicate the original test datasets when re-running the dataset creation. Nothing major but small float errors. Turns out I was passing the Julian day of the year into the old SPLASH run_one_day function as a single value xarray.DataArray rather than as an integer. That turned out to have a massive performance hit (960 seconds -> 26 seconds) but also seems to have done something truly odd with the calculations resulting in those floating point issues. It’s an int64 DataArray, so it isn’t being treated as a float. No idea what is going on here, but very puzzling, and wasn't happening when I created the datasets originally.

MarionBWeinzierl commented 6 months ago

Question: Are all the CSV and NetCDF files which would be uploaded to the repository in this PR needed to generate the datasets? Could an alternative way of storing the data (figshare or Zenodo, for example) make sense? Github is not really the place to store a lot of data, so many now would be the time to think about a PyRealm data repository somewhere else?

davidorme commented 6 months ago

Could an alternative way of storing the data (figshare or Zenodo, for example) make sense?

Definitely could. I've tried to keep them reasonably small - and they are deliberately not in the main pyrealm package so that they don't bloat it. They get used in the documentation as annotated examples for usage, but I guess we could have a roll-your-own LFS style thing that fetches a Zenodo deposit? I'm inclined to leave it for now, but it is a good point.

MarionBWeinzierl commented 6 months ago

Could an alternative way of storing the data (figshare or Zenodo, for example) make sense?

Definitely could. I've tried to keep them reasonably small - and they are deliberately not in the main pyrealm package so that they don't bloat it. They get used in the documentation as annotated examples for usage, but I guess we could have a roll-your-own LFS style thing that fetches a Zenodo deposit? I'm inclined to leave it for now, but it is a good point.

I created an issue (#189) for this and put it into the icebox for now.

tztsai commented 6 months ago

Why is splash_py_version added to clone the functionality of pyrealm.splash?

davidorme commented 6 months ago

Why is splash_py_version added to clone the functionality of pyrealm.splash?

@tztsai That directory contains a previous implementation that had all sorts of implementation and usability issues. This PR is all about using the previous version to generate some regression test inputs to validate I have re-implemented the approach successfully.

davidorme commented 6 months ago

@surbhigoel77 and @tztsai - Can someone give an approving review so I can merge 😄

I've made a couple of minor additions to make it easier to generate benchmarks and access the reference implementation from other code.