Test data - Githubissues

abkfenris / xarray_fmrc

A way to manage forecast Xarray datasets using datatrees

MIT License

14 stars 2 forks source link

Test data #14

Open emfdavid opened 1 year ago

emfdavid commented 1 year ago

Checklist

[x] I've searched the project's issues.

❓ Question

How best to provide test data for unit testing xarray_fmrc behavior?

📎 Additional context

Must work in local development and CI. Light weight solutions that make it easy to write tests are important.

Ideas from initial Slack DM with Alex:

Xarray stores data in the main repo, but the datasets are extremely small. If larger files are needed https://git-lfs.com/ is an option.
Another option is to put test data in a separate repo like the xarray tutorials do with https://github.com/pydata/xarray-data
Write code to generate test data on the fly as needed for testing (Use a with TemporaryDirectory(suffix=".test") block in the run method of your test class)
We can put datasets in a public bucket (requires adding more dev/test dependencies)
...

Please edit to add more...

Solution after discussion:

Lorem ipsum...

github-actions[bot] commented 1 year ago

Hello @emfdavid, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

abkfenris commented 1 year ago

I haven't dug into how exactly Xarray is using it's test data, but here is what it looks like:

ls -lah
total 56
drwxr-xr-x   8 akerney  staff   256B Aug  8  2021 .
drwxr-xr-x  53 akerney  staff   1.7K Aug  8  2021 ..
-rw-r--r--   1 akerney  staff   1.2K Aug  8  2021 bears.nc
-rw-r--r--   1 akerney  staff   5.1K Aug  8  2021 example.grib
-rw-r--r--   1 akerney  staff   703B Aug  8  2021 example.ict
-rw-r--r--   1 akerney  staff   608B Aug  8  2021 example.uamiv
-rw-r--r--   1 akerney  staff   1.7K Aug  8  2021 example_1.nc
-rw-r--r--   1 akerney  staff   470B Aug  8  2021 example_1.nc.gz

ncdump -h example_1.nc
netcdf example_1 {
dimensions:
    lat = 5 ;
    lon = 10 ;
    level = 4 ;
    time = UNLIMITED ; // (1 currently)
variables:
    float temp(time, level, lat, lon) ;
        temp:long_name = "temperature" ;
        temp:units = "celsius" ;
    float rh(time, lat, lon) ;
        rh:long_name = "relative humidity" ;
        rh:valid_range = 0., 1. ;
    int lat(lat) ;
        lat:units = "degrees_north" ;
    int lon(lon) ;
        lon:units = "degrees_east" ;
    int level(level) ;
        level:units = "millibars" ;
    short time(time) ;
        time:units = "hours since 1996-1-1" ;

// global attributes:
        :source = "Fictional Model Output" ;
}

abkfenris commented 1 year ago

We probably don't need a ton of data for testing, but maybe more than Xarray has.

3-4 NetCDF or GRIB files from individual model runs (each needs a handful of times with at least some overlapping)
Sidecar Kerchunk references for individual model runs
Aggregated kerchunk of the datatree
Maybe an aggregated NetCDF?

abkfenris commented 1 year ago

I chatted a little with @mpiannucci

He pointed to another library that carries test data in the tree https://github.com/wavespectra/wavespectra but mentioned that it can get problematic past 20 MB.

He also mentioned that the GFS and HRRR should be continually accessible archives in NODD.

I'm thinking that maybe we want to have a small set of data in the repo for quick tests (up to about 1 MB), then we can access NODD data for more through tests. We could use pytest marks (Scientific Python packaging reference) so we can choose to test remote data or not. We can strive to get minimum reproductions in the repo, but it shouldn't block us from moving forwards.

We could also make a pre-commit hook that errors if we ever throw too much test data into the repo, so that we don't unwillingly blow things up. We should also probably make sure we have another copy of the NODD data we're testing against somewhere.

abkfenris commented 1 year ago

@ocepaf mentioned uploading files to releases and then using pooch for access.

While we don't necessarily need to upload files to releases (except maybe to make sure we have a copy of whatever files from NODD), pooch with make local testing nicer as it will cache the files.

emfdavid commented 1 year ago

Okay let's start with some data to checkin and keep it less than 1mb total. How about I build a script to generate some test data and checkin both the script and four netcdf files to FMRC? I will riff off the above example? The complex behavior is around the forecast run time and how to access it... That seems like it requires a couple different groups of files to test against.

abkfenris commented 1 year ago

Before checking in any data (even on another branch) how about tossing it into Google Drive or similar and sharing it?

If it can be reliably built with a script quickly, then we probably don't need to check it in at all. Both Xarray and Kerchunk do some of that.