earth-system-radiation / rte-rrtmgp

RTE+RRTMGP is a set of codes for computing radiative fluxes in planetary atmospheres.
BSD 3-Clause "New" or "Revised" License
74 stars 67 forks source link

Data external to code repo? #203

Closed RobertPincus closed 9 months ago

RobertPincus commented 1 year ago

To date data for RRTMGP schemes have been distributed alongside code in the repository, while data for verification and validation including continuous integration are distributed via FTP.

I propose to package the data used by the schemes, along with validation and verification data, via e.g. Zenodo, where each version of the data gets a new (but related) DOI. Something like:

rrtmgp_data/ rrtmgp_data/gas_optics/ rrtmgp_data/cloud_optics

rrtmgp_verification_data/ rrtmgp_verification_data/examples/ rrtmgp_verification_data/examples/rfmip-clear-sky rrtmgp_verification_data/examples/all-sky

Files could be fetched for continuous integration and/or local use with the shell (e.g. wget https://zenodo.org/record/XXX/files/FILENAME.tgz) and/or with Pooch in Python.

@Chiil @vectorflux @skosukhin Any thoughts?

Chiil commented 1 year ago

I like this, especially for the verification data.

On 9 Jan 2023, at 20:31, Robert Pincus @.***> wrote:

To date data for RRTMGP schemes have been distributed alongside code in the repository, while data for verification and validation including continuous integration are distributed via FTP.

I propose to package the data used by the schemes, along with validation and verification data, via e.g. Zenodo, where each version of the data gets a new (but related) DOI. Something like:

rrtmgp_data/ rrtmgp_data/gas_optics/ rrtmgp_data/cloud_optics

rrtmgp_verification_data/ rrtmgp_verification_data/examples/ rrtmgp_verification_data/examples/rfmip-clear-sky rrtmgp_verification_data/examples/all-sky

Files could be fetched for continuous integration and/or local use with the shell (e.g. wget https://zenodo.org/record/XXX/files/FILENAME.tgz) and/or with Pooch https://www.fatiando.org/pooch/latest/protocols.html#digital-object-identifiers-dois in Python.

@Chiil https://github.com/Chiil @vectorflux https://github.com/vectorflux @skosukhin https://github.com/skosukhin Any thoughts?

— Reply to this email directly, view it on GitHub https://github.com/earth-system-radiation/rte-rrtmgp/issues/203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA252PDPGIJ4OYFW7IZ64TWRRRQPANCNFSM6AAAAAATVZT4UQ. You are receiving this because you were mentioned.

m214089 commented 1 year ago

Might be git-lfs the right solution to the problem? It is designed for applications like this!

pernak18 commented 1 year ago

@RobertPincus i like both options. we use Zenodo at AER with scientists and publications in mind. Git-LFS is nicely integrated and i think more geared to software engineers. one thing i'd add with Zenodo is that there is also a Python API available for it.

public S3 buckets in the AWS OpenData registry might also be an option

RobertPincus commented 1 year ago

Thanks for comments so far. @m214089 I'm inclined to go with Zenodo to manage data with DOIs, and retain the flexibility for Git solutions beyond Github.

Paging @dustinswales @dr0cloud for any input.

charleskawczynski commented 1 year ago

Over at RRTMGP.jl (and other CliMA repos), we've been using Julia artifacts, but those artifacts are stored on Box on the Caltech cluster, which are a bit out of sight and public view.

We've been thinking that it'd be nice if we moved the data to github repo (and use git LFS). I don't (yet) have experience with it, but I'm keen to hear what you think of it if you do try it out.

Having the data version controlled seems like a nice benefit.

RobertPincus commented 1 year ago

Given the interest in keeping the data versioned, and seeing that managing Zenodo records seems to require some handwork, a proposal:

@jbuscke @skosukhin Any comments?

m214089 commented 1 year ago

@vectorflux Jonas has done a nice job in extpar and knows the git-lfs howto ...

jbusecke commented 1 year ago

Hey everyone, as someone who does not know the data and workflow in detail I was wondering how much data we are talking about, and how often this would be updated/amended. I think the answer to these questions would be useful for me to understand the usecase more. Id imagine the data is updated much less frequently than the code. Do you envision to run code against different versions of the data (with a matrix strategy) or is the use case mostly to use the latest version with older versions of the data for full reproducibility and provenance?

RobertPincus commented 1 year ago

@jbusecke The data are updated infrequently (order yearly). In total they are less than, say, a few hundred Mb. We would be testing only the most recent combination of code+data.

jbusecke commented 1 year ago

In that case I wonder if git-lfs is worth the effort. The data is barely too large to keep it in a regular github repo, but I think this size can easily be downloaded from zenodo during the CI (at least I think that would not take too long). You could make a zenodo archive with versioning and download the latest data (I believe there is a way to always resolve a DOI to the latest version of a dataset) via a github action bash command. Or if you want to also use this data in examples/docs pooch might be the way to centralize the data urls.

That being said, I have limited experience with git-lfs and so I might be missing something here.

RobertPincus commented 1 year ago

Following @jbusecke I've opened a PR (#217) that moves the data to an external Git repo that will be synced with Zenodo at releases. It less than 100 Mb and doesn't use git-lfs.

Chiil commented 1 year ago

That sounds good, will it also be possible then to directly fetch the file, rather than having to checkout the repo?

RobertPincus commented 1 year ago

@Chiil Yes, we plan to publish a new release each time the data changes; this will be archived with a DOI at Zenodo and files could be fetched directly e.g. in Python with Pooch.

RobertPincus commented 1 year ago

Merged into develop with a6ccd35

RobertPincus commented 9 months ago

Closed with 3ac0636