Closed nilshempelmann closed 4 years ago
@cehbrecht @huard is there a place to centralise test data? or shoud data from the oranous thredds be used?
Options are:
@huard got it. Will take the file available on Ouranos THREDDS server. They are may be too large for testing purposes, so we ll add test data later on.
@nilshempelmann If use data on our Ouranos Thredds server, try to limit yourself to data under testdata/flyingpigeon/{cordex, cmip3, cmip5}
(see comment https://github.com/bird-house/birdhouse-deploy/issues/6#issuecomment-587939489).
For legal/size issues and hopefully (I am not sure about this one) for your notebooks to work against other Thredds server deployed as part of PAVICS deployment at other organizations when testing under Jenkins.
@tlvu OK. got it.
Climate signal processes are using ensembles as input. That's not avaibale in testdata/flyingpigeon/{cordex, cmip3, cmip5}
. I ll use what's there to get the notebooks running and we'll find a option for the testdata including licence.
@tlvu @cehbrecht I merged the PR still with the hardcodes. integrations tests should be performed at least on two servers ( e.g. pavics and bovec ) as part of the release cycle. Avoiding hard codes and using available smal test data. Any opinions?
Link to first round of implementation PR https://github.com/bird-house/flyingpigeon/pull/321
Options are:
* Ouranos THREDDS server Pros: Already running and maintained. Cons: Devs need to ask a PAVICS admin to add new files, requires download to local server * Github Large File Storage Pros: Can use github permissions Cons: No DAP access, requires download to local server * Synthetically generated local files Pros: No memory footprint, easy to modify, can be customized for testing Cons: More up-front work, not available from client (unless we write a new test_dataset process that would return all the test files)
Option 2 "Github Large File Storage" and option 3 "Synthetically generated local files" will provide traceability (history of code commit) so we know "who, when, why" a change to test data occurred and can revert if needed. Also provide reproducibilty of the dataset on another Thredds server if needed.
Option 1 "Ouranos THREDDS server" is good to ensure the production Thredds server is working properly.
I think the best solution would be to combine both to have both advantages. I mean that the test data deployed to the Thredds server is version controlled so it can be traceable and reproducible.
Edit:
It might just be my experience, but I remember git-lfs being a bit of a hassle to manage properly. I would opine for options 1 and 3, but if there is a way to ensure that git-lfs doesn't break any existing repositories (maybe we can make a data-specific repo that can be called with git clone?), I would be all for it.
integrations tests should be performed at least on two servers ( e.g. pavics and bovec ) as part of the release cycle. Avoiding hard codes and using available smal test data. Any opinions?
@nilshempelmann Oh I didn't know there's a PAVICS deployment on a "bovec" server ! What's the full hostname, just to check it out.
So agree with you integration tests should be done against real servers. Our Jenkins is already testing against pavics.ouranos.ca
, will be trivial for us to add FP notebooks to the collection of notebooks we are testing nightly (right now Pavics-sdi and Finch). FP notebooks should pass, else our nightly will always fail, defeating the purpose of continuous testing.
So if you can make your test data as small as possible, with proper licencing, I can help host it on our Ouranos Thredds to remove the hardcode.
As for notebook testing compatibility with "bovec" server, if bovec is also a PAVICS deployment, should also be compatible.
It might just be my experience, but I remember git-lfs being a bit of a hassle to manage properly. I would opine for options 1 and 3, but if there is a way to ensure that git-lfs doesn't break any existing repositories (maybe we can make a data-specific repo that can be called with git clone?), I would be all for it.
If test data files are small enough (maybe 5Mb or less) and there are not so many files (try to make the same test data re-usable across several test cases), and they do not change very frequently like source code file, then a regular git repo is probably okay in terms of performance and disk space consumption.
It all depends on how discipline we are with creating those test data file.
@tlvu https://bovec.dkrz.de/ Its the playground and sandbox for several projects as well as and birdhouse demoserver. Some of the WPS are directly related to the C3S API (https://climate.copernicus.eu/climate-data-store). When we design test we should be interoperable with this servers.
Options are:
- Ouranos THREDDS server Pros: Already running and maintained. Cons: Devs need to ask a PAVICS admin to add new files, requires download to local server
- Github Large File Storage Pros: Can use github permissions Cons: No DAP access, requires download to local server
- Synthetically generated local files Pros: No memory footprint, easy to modify, can be customized for testing Cons: More up-front work, not available from client (unless we write a new test_dataset process that would return all the test files)
@cehbrecht Isn't there also an option to use the Climate Data Store (CDS) or DKRZ THREDDS server? Wouldn't a combination of Ouranos THREDDS server
and CDS
be a good solution for our purposes?
The CDS needs valid credentials in order to work, but I imagine we could generate some secret keys for Jenkins?
The CDS needs valid credentials in order to work, but I imagine we could generate some secret keys for Jenkins?
Yes, Jenkins have a way to securely store credentials.
It's used in this notebook https://github.com/Ouranosinc/pavics-sdi/blob/master/docs/source/notebooks/esgf-dap.ipynb
username = os.environ.get('ESGF_AUTH_USERNAME', '<your openid>')
password = os.environ.get('ESGF_AUTH_PASSWORD', '<password>')
Options are:
- Ouranos THREDDS server Pros: Already running and maintained. Cons: Devs need to ask a PAVICS admin to add new files, requires download to local server
- Github Large File Storage Pros: Can use github permissions Cons: No DAP access, requires download to local server
- Synthetically generated local files Pros: No memory footprint, easy to modify, can be customized for testing Cons: More up-front work, not available from client (unless we write a new test_dataset process that would return all the test files)
@cehbrecht Isn't there also an option to use the Climate Data Store (CDS) or DKRZ THREDDS server? Wouldn't a combination of
Ouranos THREDDS server
andCDS
be a good solution for our purposes?
In copernicus we have started to build our own test-data ... reduced real data: https://github.com/roocs/mini-esgf-data
CMIP6 data is now public in ESGF ... no access restrictions. But you need to select small files for tests ...
In copernicus we have started to build our own test-data ... reduced real data: https://github.com/roocs/mini-esgf-data
@huard @tlogan2000 if you guys think it's useful, I can look into deploying those test datasets on our Thredds.
@cehbrecht is there a public Thredds server already hosting those test datasets? Maybe we can use that Thredds server instead of having to deploy those test datasets on ours?
@tlvu The data in that repo looks good. I say go for it.
Sounds good to me as well
In copernicus we have started to build our own test-data ... reduced real data: https://github.com/roocs/mini-esgf-data
@huard @tlogan2000 if you guys think it's useful, I can look into deploying those test datasets on our Thredds.
@cehbrecht is there a public Thredds server already hosting those test datasets? Maybe we can use that Thredds server instead of having to deploy those test datasets on ours?
@tlvu The data size is quite small ... we don't need thredds. We just add it as git submodule to our repos for testing. Example: https://github.com/roocs/daops/blob/master/.gitmodules
@tlvu : how to fix this:
assert reference_output == test_output failed:
'Metalink con..._subset.nc.\n' == 'Metalink con..._subset.nc.\n'
Metalink content-type detected.
- Downloading to /tmp/tmpRANDOM/slp.2000_bbox_subset.nc.
? ^^^^^^
+ Downloading to /tmp/tmp_v07l0_1/slp.2000_bbox_subset.nc.
? ^^^^^^^^
- Downloading to /tmp/tmpRANDOM/slp.2001_bbox_subset.nc.
? ^^^^^^
+ Downloading to /tmp/tmp_v07l0_1/slp.2001_bbox_subset.nc.
? ^^^^^^^^
@nilshempelmann should be fixed.
+ Downloading to /tmp/tmp_v07l0_1/slp.2001_bbox_subset.nc. ? ^^^^^^^^
@tlvu great! fixed. missing this issue: https://github.com/bird-house/flyingpigeon/issues/322
Description
based on https://github.com/bird-house/flyingpigeon/pull/319