bird-house / flyingpigeon

WPS processes for climate model data, indices and extreme events
http://flyingpigeon.readthedocs.io/en/latest/
Apache License 2.0
19 stars 15 forks source link

add tests and test-notebooks #320

Closed nilshempelmann closed 4 years ago

nilshempelmann commented 4 years ago

Description

based on https://github.com/bird-house/flyingpigeon/pull/319

nilshempelmann commented 4 years ago

@cehbrecht @huard is there a place to centralise test data? or shoud data from the oranous thredds be used?

huard commented 4 years ago

Options are:

nilshempelmann commented 4 years ago

@huard got it. Will take the file available on Ouranos THREDDS server. They are may be too large for testing purposes, so we ll add test data later on.

tlvu commented 4 years ago

@nilshempelmann If use data on our Ouranos Thredds server, try to limit yourself to data under testdata/flyingpigeon/{cordex, cmip3, cmip5} (see comment https://github.com/bird-house/birdhouse-deploy/issues/6#issuecomment-587939489).

For legal/size issues and hopefully (I am not sure about this one) for your notebooks to work against other Thredds server deployed as part of PAVICS deployment at other organizations when testing under Jenkins.

nilshempelmann commented 4 years ago

@tlvu OK. got it. Climate signal processes are using ensembles as input. That's not avaibale in testdata/flyingpigeon/{cordex, cmip3, cmip5}. I ll use what's there to get the notebooks running and we'll find a option for the testdata including licence.

nilshempelmann commented 4 years ago

@tlvu @cehbrecht I merged the PR still with the hardcodes. integrations tests should be performed at least on two servers ( e.g. pavics and bovec ) as part of the release cycle. Avoiding hard codes and using available smal test data. Any opinions?

tlvu commented 4 years ago

Link to first round of implementation PR https://github.com/bird-house/flyingpigeon/pull/321

tlvu commented 4 years ago

Options are:

* Ouranos THREDDS server
  Pros: Already running and maintained.
  Cons: Devs need to ask a PAVICS admin to add new files, requires download to local server

* Github Large File Storage
  Pros: Can use github permissions
  Cons: No DAP access, requires download to local server

* Synthetically generated local files
  Pros: No memory footprint, easy to modify, can be customized for testing
  Cons: More up-front work, not available from client (unless we write a new test_dataset process that would return all the test files)

Option 2 "Github Large File Storage" and option 3 "Synthetically generated local files" will provide traceability (history of code commit) so we know "who, when, why" a change to test data occurred and can revert if needed. Also provide reproducibilty of the dataset on another Thredds server if needed.

Option 1 "Ouranos THREDDS server" is good to ensure the production Thredds server is working properly.

I think the best solution would be to combine both to have both advantages. I mean that the test data deployed to the Thredds server is version controlled so it can be traceable and reproducible.

Edit:

Zeitsperre commented 4 years ago

It might just be my experience, but I remember git-lfs being a bit of a hassle to manage properly. I would opine for options 1 and 3, but if there is a way to ensure that git-lfs doesn't break any existing repositories (maybe we can make a data-specific repo that can be called with git clone?), I would be all for it.

tlvu commented 4 years ago

integrations tests should be performed at least on two servers ( e.g. pavics and bovec ) as part of the release cycle. Avoiding hard codes and using available smal test data. Any opinions?

@nilshempelmann Oh I didn't know there's a PAVICS deployment on a "bovec" server ! What's the full hostname, just to check it out.

So agree with you integration tests should be done against real servers. Our Jenkins is already testing against pavics.ouranos.ca, will be trivial for us to add FP notebooks to the collection of notebooks we are testing nightly (right now Pavics-sdi and Finch). FP notebooks should pass, else our nightly will always fail, defeating the purpose of continuous testing.

So if you can make your test data as small as possible, with proper licencing, I can help host it on our Ouranos Thredds to remove the hardcode.

As for notebook testing compatibility with "bovec" server, if bovec is also a PAVICS deployment, should also be compatible.

tlvu commented 4 years ago

It might just be my experience, but I remember git-lfs being a bit of a hassle to manage properly. I would opine for options 1 and 3, but if there is a way to ensure that git-lfs doesn't break any existing repositories (maybe we can make a data-specific repo that can be called with git clone?), I would be all for it.

If test data files are small enough (maybe 5Mb or less) and there are not so many files (try to make the same test data re-usable across several test cases), and they do not change very frequently like source code file, then a regular git repo is probably okay in terms of performance and disk space consumption.

It all depends on how discipline we are with creating those test data file.

nilshempelmann commented 4 years ago

@tlvu https://bovec.dkrz.de/ Its the playground and sandbox for several projects as well as and birdhouse demoserver. Some of the WPS are directly related to the C3S API (https://climate.copernicus.eu/climate-data-store). When we design test we should be interoperable with this servers.

nilshempelmann commented 4 years ago

Options are:

  • Ouranos THREDDS server Pros: Already running and maintained. Cons: Devs need to ask a PAVICS admin to add new files, requires download to local server
  • Github Large File Storage Pros: Can use github permissions Cons: No DAP access, requires download to local server
  • Synthetically generated local files Pros: No memory footprint, easy to modify, can be customized for testing Cons: More up-front work, not available from client (unless we write a new test_dataset process that would return all the test files)

@cehbrecht Isn't there also an option to use the Climate Data Store (CDS) or DKRZ THREDDS server? Wouldn't a combination of Ouranos THREDDS server and CDS be a good solution for our purposes?

Zeitsperre commented 4 years ago

The CDS needs valid credentials in order to work, but I imagine we could generate some secret keys for Jenkins?

tlvu commented 4 years ago

The CDS needs valid credentials in order to work, but I imagine we could generate some secret keys for Jenkins?

Yes, Jenkins have a way to securely store credentials.

It's used in this notebook https://github.com/Ouranosinc/pavics-sdi/blob/master/docs/source/notebooks/esgf-dap.ipynb

username = os.environ.get('ESGF_AUTH_USERNAME', '<your openid>')
password = os.environ.get('ESGF_AUTH_PASSWORD', '<password>')
cehbrecht commented 4 years ago

Options are:

  • Ouranos THREDDS server Pros: Already running and maintained. Cons: Devs need to ask a PAVICS admin to add new files, requires download to local server
  • Github Large File Storage Pros: Can use github permissions Cons: No DAP access, requires download to local server
  • Synthetically generated local files Pros: No memory footprint, easy to modify, can be customized for testing Cons: More up-front work, not available from client (unless we write a new test_dataset process that would return all the test files)

@cehbrecht Isn't there also an option to use the Climate Data Store (CDS) or DKRZ THREDDS server? Wouldn't a combination of Ouranos THREDDS server and CDS be a good solution for our purposes?

In copernicus we have started to build our own test-data ... reduced real data: https://github.com/roocs/mini-esgf-data

CMIP6 data is now public in ESGF ... no access restrictions. But you need to select small files for tests ...

tlvu commented 4 years ago

In copernicus we have started to build our own test-data ... reduced real data: https://github.com/roocs/mini-esgf-data

@huard @tlogan2000 if you guys think it's useful, I can look into deploying those test datasets on our Thredds.

@cehbrecht is there a public Thredds server already hosting those test datasets? Maybe we can use that Thredds server instead of having to deploy those test datasets on ours?

Zeitsperre commented 4 years ago

@tlvu The data in that repo looks good. I say go for it.

tlogan2000 commented 4 years ago

Sounds good to me as well

cehbrecht commented 4 years ago

In copernicus we have started to build our own test-data ... reduced real data: https://github.com/roocs/mini-esgf-data

@huard @tlogan2000 if you guys think it's useful, I can look into deploying those test datasets on our Thredds.

@cehbrecht is there a public Thredds server already hosting those test datasets? Maybe we can use that Thredds server instead of having to deploy those test datasets on ours?

@tlvu The data size is quite small ... we don't need thredds. We just add it as git submodule to our repos for testing. Example: https://github.com/roocs/daops/blob/master/.gitmodules

nilshempelmann commented 4 years ago

related: https://github.com/bird-house/flyingpigeon/issues/322

nilshempelmann commented 4 years ago

@tlvu : how to fix this:

assert reference_output == test_output failed:                                                 

  'Metalink con..._subset.nc.\n' == 'Metalink con..._subset.nc.\n'                              
    Metalink content-type detected.                                                             
  - Downloading to /tmp/tmpRANDOM/slp.2000_bbox_subset.nc.                                      
  ?                        ^^^^^^                                                               
  + Downloading to /tmp/tmp_v07l0_1/slp.2000_bbox_subset.nc.                                    
  ?                        ^^^^^^^^                                                             
  - Downloading to /tmp/tmpRANDOM/slp.2001_bbox_subset.nc.                                      
  ?                        ^^^^^^                                                               
  + Downloading to /tmp/tmp_v07l0_1/slp.2001_bbox_subset.nc.                                    
  ?                        ^^^^^^^^                                                             
tlvu commented 4 years ago

@nilshempelmann should be fixed.

  + Downloading to /tmp/tmp_v07l0_1/slp.2001_bbox_subset.nc.                                    
  ?                        ^^^^^^^^                                                             
nilshempelmann commented 4 years ago

@tlvu great! fixed. missing this issue: https://github.com/bird-house/flyingpigeon/issues/322