Open emfdavid opened 1 year ago
Hello @emfdavid, thank you for your interest in our work!
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
I haven't dug into how exactly Xarray is using it's test data, but here is what it looks like:
ls -lah
total 56
drwxr-xr-x 8 akerney staff 256B Aug 8 2021 .
drwxr-xr-x 53 akerney staff 1.7K Aug 8 2021 ..
-rw-r--r-- 1 akerney staff 1.2K Aug 8 2021 bears.nc
-rw-r--r-- 1 akerney staff 5.1K Aug 8 2021 example.grib
-rw-r--r-- 1 akerney staff 703B Aug 8 2021 example.ict
-rw-r--r-- 1 akerney staff 608B Aug 8 2021 example.uamiv
-rw-r--r-- 1 akerney staff 1.7K Aug 8 2021 example_1.nc
-rw-r--r-- 1 akerney staff 470B Aug 8 2021 example_1.nc.gz
ncdump -h example_1.nc
netcdf example_1 {
dimensions:
lat = 5 ;
lon = 10 ;
level = 4 ;
time = UNLIMITED ; // (1 currently)
variables:
float temp(time, level, lat, lon) ;
temp:long_name = "temperature" ;
temp:units = "celsius" ;
float rh(time, lat, lon) ;
rh:long_name = "relative humidity" ;
rh:valid_range = 0., 1. ;
int lat(lat) ;
lat:units = "degrees_north" ;
int lon(lon) ;
lon:units = "degrees_east" ;
int level(level) ;
level:units = "millibars" ;
short time(time) ;
time:units = "hours since 1996-1-1" ;
// global attributes:
:source = "Fictional Model Output" ;
}
We probably don't need a ton of data for testing, but maybe more than Xarray has.
I chatted a little with @mpiannucci
He pointed to another library that carries test data in the tree https://github.com/wavespectra/wavespectra but mentioned that it can get problematic past 20 MB.
He also mentioned that the GFS and HRRR should be continually accessible archives in NODD.
I'm thinking that maybe we want to have a small set of data in the repo for quick tests (up to about 1 MB), then we can access NODD data for more through tests. We could use pytest marks (Scientific Python packaging reference) so we can choose to test remote data or not. We can strive to get minimum reproductions in the repo, but it shouldn't block us from moving forwards.
We could also make a pre-commit hook that errors if we ever throw too much test data into the repo, so that we don't unwillingly blow things up. We should also probably make sure we have another copy of the NODD data we're testing against somewhere.
@ocepaf mentioned uploading files to releases and then using pooch for access.
While we don't necessarily need to upload files to releases (except maybe to make sure we have a copy of whatever files from NODD), pooch with make local testing nicer as it will cache the files.
Okay let's start with some data to checkin and keep it less than 1mb total. How about I build a script to generate some test data and checkin both the script and four netcdf files to FMRC? I will riff off the above example? The complex behavior is around the forecast run time and how to access it... That seems like it requires a couple different groups of files to test against.
Before checking in any data (even on another branch) how about tossing it into Google Drive or similar and sharing it?
If it can be reliably built with a script quickly, then we probably don't need to check it in at all. Both Xarray and Kerchunk do some of that.
Checklist
issues
.❓ Question
How best to provide test data for unit testing xarray_fmrc behavior?
📎 Additional context
Must work in local development and CI. Light weight solutions that make it easy to write tests are important.
Ideas from initial Slack DM with Alex:
with TemporaryDirectory(suffix=".test")
block in the run method of your test class)Please edit to add more...
Solution after discussion:
Lorem ipsum...