Closed RobertPincus closed 9 months ago
I like this, especially for the verification data.
On 9 Jan 2023, at 20:31, Robert Pincus @.***> wrote:
To date data for RRTMGP schemes have been distributed alongside code in the repository, while data for verification and validation including continuous integration are distributed via FTP.
I propose to package the data used by the schemes, along with validation and verification data, via e.g. Zenodo, where each version of the data gets a new (but related) DOI. Something like:
rrtmgp_data/ rrtmgp_data/gas_optics/ rrtmgp_data/cloud_optics
rrtmgp_verification_data/ rrtmgp_verification_data/examples/ rrtmgp_verification_data/examples/rfmip-clear-sky rrtmgp_verification_data/examples/all-sky
Files could be fetched for continuous integration and/or local use with the shell (e.g. wget https://zenodo.org/record/XXX/files/FILENAME.tgz) and/or with Pooch https://www.fatiando.org/pooch/latest/protocols.html#digital-object-identifiers-dois in Python.
@Chiil https://github.com/Chiil @vectorflux https://github.com/vectorflux @skosukhin https://github.com/skosukhin Any thoughts?
— Reply to this email directly, view it on GitHub https://github.com/earth-system-radiation/rte-rrtmgp/issues/203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA252PDPGIJ4OYFW7IZ64TWRRRQPANCNFSM6AAAAAATVZT4UQ. You are receiving this because you were mentioned.
Might be git-lfs the right solution to the problem? It is designed for applications like this!
@RobertPincus i like both options. we use Zenodo at AER with scientists and publications in mind. Git-LFS is nicely integrated and i think more geared to software engineers. one thing i'd add with Zenodo is that there is also a Python API available for it.
public S3 buckets in the AWS OpenData registry might also be an option
Thanks for comments so far. @m214089 I'm inclined to go with Zenodo to manage data with DOIs, and retain the flexibility for Git solutions beyond Github.
Paging @dustinswales @dr0cloud for any input.
Over at RRTMGP.jl (and other CliMA repos), we've been using Julia artifacts, but those artifacts are stored on Box on the Caltech cluster, which are a bit out of sight and public view.
We've been thinking that it'd be nice if we moved the data to github repo (and use git LFS). I don't (yet) have experience with it, but I'm keen to hear what you think of it if you do try it out.
Having the data version controlled seems like a nice benefit.
Given the interest in keeping the data versioned, and seeing that managing Zenodo records seems to require some handwork, a proposal:
rrtmgp-data
repo will be included as a Git submodule in the RTE+RRTMGP repo. @jbuscke @skosukhin Any comments?
@vectorflux Jonas has done a nice job in extpar and knows the git-lfs howto ...
Hey everyone, as someone who does not know the data and workflow in detail I was wondering how much data we are talking about, and how often this would be updated/amended. I think the answer to these questions would be useful for me to understand the usecase more. Id imagine the data is updated much less frequently than the code. Do you envision to run code against different versions of the data (with a matrix strategy) or is the use case mostly to use the latest version with older versions of the data for full reproducibility and provenance?
@jbusecke The data are updated infrequently (order yearly). In total they are less than, say, a few hundred Mb. We would be testing only the most recent combination of code+data.
In that case I wonder if git-lfs is worth the effort. The data is barely too large to keep it in a regular github repo, but I think this size can easily be downloaded from zenodo during the CI (at least I think that would not take too long). You could make a zenodo archive with versioning and download the latest data (I believe there is a way to always resolve a DOI to the latest version of a dataset) via a github action bash command. Or if you want to also use this data in examples/docs pooch might be the way to centralize the data urls.
That being said, I have limited experience with git-lfs and so I might be missing something here.
Following @jbusecke I've opened a PR (#217) that moves the data to an external Git repo that will be synced with Zenodo at releases. It less than 100 Mb and doesn't use git-lfs.
That sounds good, will it also be possible then to directly fetch the file, rather than having to checkout the repo?
@Chiil Yes, we plan to publish a new release each time the data changes; this will be archived with a DOI at Zenodo and files could be fetched directly e.g. in Python with Pooch.
Merged into develop
with a6ccd35
Closed with 3ac0636
To date data for RRTMGP schemes have been distributed alongside code in the repository, while data for verification and validation including continuous integration are distributed via FTP.
I propose to package the data used by the schemes, along with validation and verification data, via e.g. Zenodo, where each version of the data gets a new (but related) DOI. Something like:
rrtmgp_data/ rrtmgp_data/gas_optics/ rrtmgp_data/cloud_optics
rrtmgp_verification_data/ rrtmgp_verification_data/examples/ rrtmgp_verification_data/examples/rfmip-clear-sky rrtmgp_verification_data/examples/all-sky
Files could be fetched for continuous integration and/or local use with the shell (e.g.
wget https://zenodo.org/record/XXX/files/FILENAME.tgz
) and/or with Pooch in Python.@Chiil @vectorflux @skosukhin Any thoughts?