Implement automated notebook testing

robbibt commented 4 years ago

As this repository continues to grow, a larger and larger amount of time is going to be spent re-running notebooks and verifying that all our functionality still works after every change to the DEA environments and our scripts/functions.

This seems like a natural fit for some kind of automated testing, e.g. to automatically re-run all notebooks after changes are pushed to develop to verify that updates haven't broken any notebooks before we ultimately update master.

Similarly, being able to automatically re-build a test version of the DEA User Guide from develop would also greatly assist in reducing formatting errors and re-writes required to generate the documentation. (added as of #719 )

This issue is to capture discussion around the practicality of implementing automated testing on this repository. Any thoughts would be greatly appreciated!

robbibt commented 4 years ago

Even a manual script we could run (on Sandbox/NCI) to re-run every notebook in the repo and report passes/fails before we merge develop > master would be better than what we currently have

Kirill888 commented 4 years ago

Ideal solution, I think, is to have a minimal subset database that we automatically compute and keep up to date by running against real database in some instrumented way that simply records all the datasets that notebooks under test accesses and then recreates database with just those datasets and products and metadata. Then we can use something like github actions to run tests against this. Note this will also be useful for local dev work and for running tests under our own infrastructure, so that tests do no impact database performance.

robbibt commented 4 years ago

Tagging @omad @uchchwhash @caitlinadams @alexgleith @emmaai @vnewey @GypsyBojangles @whatnick in case any of you have any thoughts - this is probably a medium/long term goal, but it's the kind of thing that could help a lot for reducing the amount of manual notebook re-running and code checking we currently need to do every time a function is altered or a new module version updated

BexDunn commented 4 years ago

heavily support ++++

cbur24 commented 4 years ago

This collection of notebooks and scripts is a vital piece of infrastructure for ourselves and our external stakeholders, and I fully support any measure to reduce the workload on maintaining the repo. I personally don't know anything about automated testing, but I'm happy to be allocated any tasks if/when this method materializes (I will also implement any solution on the de-africa repo).

benjimin commented 4 years ago

For typical automated tests, you specify the pass/fail test very clearly. What do you expect in the case of a notebook?

For example, if more data is added to the index over time, a plot might look slightly different. Should that still pass?

It is probably fairly easy though to test that all notebooks run with no uncaught error messages.

caitlinadams commented 4 years ago

I think it's also worth exploring whether the scripts should be converted to an installable python package/module (possibly through pip). This would likely help with maintenance, make importing them simpler, and mean that the useful tools could be incorporated easily into other DEA/ODC deployments.

robbibt commented 4 years ago

@benjimin For this, I think simple is probably best, e.g. perhaps even whether a notebook returned an error at any point in its run. The biggest problem we have is that changes to the environment/scripts cause notebooks to break, which then lies dormant until a stakeholder tries it themselves and gets errors, which isn't a great look.

If we could even get a simple list of passed/failed notebooks, that would help greatly in cutting down the work-load of finding all the relevant notebooks to update after each change.

Kirill888 commented 4 years ago

@robbibt I think one can start with "manually generate success/failure report" with a single script launch after manually logging into sandbox/NCI.

benjimin commented 4 years ago

Should be relatively straightforward to set up Travis to report if any notebooks are emitting error messages: pytest --nbval-lax

What about notebooks that are slow to execute however?

Kirill888 commented 4 years ago

@benjimin they all will report errors on travis, because database

benjimin commented 4 years ago

Well, @robbibt could run:

pip install --user nbval
pytest --nbval-lax

For example: the beginner notebooks pass (aside from a deprecation warning), the initial template fails (because the os.path manipulations do not expect it to be at the root level), and the real world examples probably require a pytest --timeout=..

robbibt commented 4 years ago

@benjimin That's very promising: it works well and gives a good summary of the issues:

(some of those fails haven't been reported yet, so good proof of concept for the kinds of things we want to pick up on!)

We'd probably need to use that list in the context of what notebooks are meant to be compatible with either the NCI or Sandbox, but it's a great start.

Kirill888 commented 4 years ago

@robbibt regarding failures:

Opening_GeoTIFFs_NetCDFs.ipynb failure are due to env being bad see here: https://github.com/opendatacube/jupyterhub/issues/37 runs fine with work-around (downgraded netcdf lib)
Exporting_NetCDFs.ipynb is also due to bad env setup, but also might fail if output file was not cleaned up from previous partial runs as I think write function will fail if file exists.
Image_segmentation.ipynb failures are also due to environment -- no support for kea files in the sandbox.

@alexgleith we can probably assemble a good env smoke test for testing sandbox docker deploys from a lot of these notebooks, problem is db access and s3 to a lesser extent I suppose too. Also some take long time to complete, so we will need to a pick a good set (fast and uses libraries we don't use directly in datacube)

alexgleith commented 4 years ago

DB access could be done by indexing ONLY the datasets that are required into a Postgres image, either fixed or ... routinely indexed. Perhaps that's a good excuse to write a nice indexing Docker image? It can be tested by indexing the data required by ☝️. Not sure how to get that to result in an actual postgres DB image, but it's plausible.

We can test against S3 data easy enough, so long as we set the AWS_NO_SIGN_REQUEST flag and the data is public.

robbibt commented 4 years ago

@Kirill888 Yep, some of those fails are to be expected as not all notebooks are written to be compatible with the Sandbox or NCI (e.g. Image_segmentation.ipynb)... the errors with the two NetCDF notebooks though are exactly the kind of thing that would be great to be able to automatically catch.

For processing time, a lot of the notebooks have recently become slower with Landsat C3 data now coming from THREDDS... we should revisit some of them and make sure the data loads are as minimal as possible.

Kirill888 commented 4 years ago

@alexgleith Using this image:

opendatacube/sandbox https://hub.docker.com/layers/opendatacube/sandbox/latest/images/sha256-a14601514daf26e8374ffadd4bb89ad508dd8e46e6c143082b1014b067c5d6e0?context=repo

I get errors only in the segmentation notebook which requires rsgislib which is a conda only library

jovyan@notebook-7844f944b6-nnckr:~/dea-notebooks/Frequently_used_code$ AWS_NO_SIGN_REQUEST=true pytest --nbval-lax
============================================================================= test session starts =============================================================================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1                                         
rootdir: /home/jovyan/dea-notebooks/Frequently_used_code                   
plugins: nbval-0.9.5                                                    
collected 261 items                                                                                                                                                           

Animated_timeseries.ipynb ................                                                                                                                              [  6%]
Applying_WOfS_bitmasking.ipynb .................                                                                                                                        [ 12%]
Calculating_band_indices.ipynb .........                                                                                                                                [ 16%]
Contour_extraction.ipynb ..............                                                                                                                                 [ 21%]
Exporting_GeoTIFFs.ipynb ..........                                                                                                                                     [ 25%] 
Exporting_NetCDFs.ipynb .........                                                                                                                                       [ 28%]
Geomedian_composites.ipynb ........                                                                                                                                     [ 31%] 
Image_segmentation.ipynb F.FFFFFFF.FFFF.                                                                                                                                [ 37%]
Imagery_on_web_map.ipynb ............                                                                                                                                   [ 42%]
Integrating_external_data.ipynb ............                                                                                                                            [ 46%]
Machine_learning_with_ODC.ipynb ....................                                                                                                                    [ 54%]
Masking_data.ipynb ..............                                                                                                                                       [ 59%] 
Opening_GeoTIFFs_NetCDFs.ipynb ............                                                                                                                             [ 64%]
Pan_sharpening_Brovey.ipynb .............                                                                                                                               [ 69%]
Rasterize_vectorize.ipynb ..........                                                                                                                                    [ 73%]
Using_load_ard.ipynb ...............                                                                                                                                    [ 78%]
Virtual_products.ipynb ...........................................                                                                                                      [ 95%] 
Working_with_time.ipynb ............                                                                                                                                    [100%] 

================================================================================== FAILURES ===================================================================================

alexgleith commented 4 years ago

That's positive, @Kirill888! How do you run it? On the actual Sandbox?

robbibt commented 4 years ago

Yep, it's looking really useful for what we're after. The full test suite would also include the following directories:

Beginners_guide
DEA_datasets
Frequently_used_code
Real_world_examples

Some of the Real_world_examples might need to be shortened for run time.

An interesting question I haven't investigated yet: how does this work for interactive notebooks that require user input before they continue through the code? I imagine they simply won't complete, rather than return an error...

Kirill888 commented 4 years ago

@alexgleith I can't get that image into actual sandbox, and dev sandbox is too different database-wise, I'm running via custom deployment in kk namespace

Kirill888 commented 4 years ago

=============================== test session starts ===============================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/jovyan/dea-notebooks/Beginners_guide
plugins: nbval-0.9.5
collected 55 items                                                      

01_Jupyter_notebooks.ipynb ...                                              [  5%]
03_Products_and_measurements.ipynb ......                                   [ 16%]
04_Loading_data.ipynb ................                                      [ 45%]
05_Plotting.ipynb ..................                                        [ 78%]
06_Basic_analysis.ipynb ............                                        [100%]

================================ warnings summary =================================

In DEA_datasets Seintel_2 succeeds but TSmask is taking forever so far.

robbibt commented 4 years ago

@Kirill888 That might be because the TSmask product is being worked on (I remember @uchchwhash saying it wouldn't be accessibly for a while)

Kirill888 commented 4 years ago

no it's not that, it's just slow, anyway it failed now. Changes were made to product definition: masking flags are now consistent with fmask so this notebook needs updating.

benjimin commented 4 years ago

I wonder if a reduced database index would be impractical because one purpose of the notebooks is to demonstrate diverse insights that utilise different specific slices of the data archive. (I also think it would be easier to monkey patch in a mock Datacube.load to always return say a dozen layers thick of numpy.random. Still, not hard to imagine notebooks where dummy data would trip them into an exception.)

Is there any way to configure Travis (or some other github hook service) to ssh into both NCI and sandbox, and test against the real data? (This seems like a more relevant test.)

Kirill888 commented 4 years ago

AWS_NO_SIGN_REQUEST=true pytest --nbval-lax 
================================ test session starts ================================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/jovyan/dea-notebooks/Frequently_used_code
plugins: nbval-0.9.5
collected 262 items                                                                 

Animated_timeseries.ipynb ................                                    [  6%]
Applying_WOfS_bitmasking.ipynb .................                              [ 12%]
Calculating_band_indices.ipynb .........                                      [ 16%]
Contour_extraction.ipynb ..............                                       [ 21%]
Exporting_GeoTIFFs.ipynb ..........                                           [ 25%]
Exporting_NetCDFs.ipynb .........                                             [ 28%]
Geomedian_composites.ipynb ........                                           [ 31%]
Image_segmentation.ipynb ................                                     [ 37%]
Imagery_on_web_map.ipynb ............                                         [ 42%]
Integrating_external_data.ipynb ............                                  [ 46%]
Machine_learning_with_ODC.ipynb ....................                          [ 54%]
Masking_data.ipynb ..............                                             [ 59%]
Opening_GeoTIFFs_NetCDFs.ipynb ............                                   [ 64%]
Pan_sharpening_Brovey.ipynb .............                                     [ 69%]
Rasterize_vectorize.ipynb ..........                                          [ 73%]
Using_load_ard.ipynb ...............                                          [ 79%]
Virtual_products.ipynb ...........................................            [ 95%]
Working_with_time.ipynb ............                                          [100%]

================================= warnings summary ==================================
Animated_timeseries.ipynb::Cell 0
  /env/lib/python3.6/site-packages/jupyter_client/manager.py:62: DeprecationWarning: KernelManager._kernel_spec_manager_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
    def _kernel_spec_manager_changed(self):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================== 262 passed, 1 warning in 1607.01s (0:26:47) ====================

With rgsislib installed in this docker:

https://hub.docker.com/layers/opendatacube/sandbox/latest/images/sha256-da54e19602bb635abfa6671353d64a1d0ce0f008a00fd54daae3c39cbb141c71?context=repo

Required a small patch to Image_segmentation.ipynb, see #518.

whatnick commented 4 years ago

From an infrastructure perspective. It is difficult to reproduce the sandbox single-user containers in a 3rd-party CI with injected environment variables and DB access included. However we can move the code the environment and run a daily kubernetes cron job to checkout notebooks and run jupyter nbconvert --to notebook --execute sample_notebook.ipynb on them within the cluster. This will take a while and not work for interactive ones. However it will solve the manual validation currently taking place in one hit.

benjimin commented 4 years ago

Another reason for testing against the real data is to notice if notebooks are broken by changes in curated data collections. Two possible alternatives that I think would be compatible with our workflows:

Use Travis: Get a Travis public key, modify .travis.yaml to contain encrypted credentials (which will automatically be decrypted into environment variables) and an ssh command (to run tests on our platform and report back the exit code). For security reasons I think Travis will only report on pull requests from our internal branches, and not from other forks.
Don't integrate with GitHub. Make something behave kind of like a cron job, run twice per week, and broadcast a condensed summary result to a popular slack channel. That way a person who happens to do a minor PR is not the one having to deal with the sudden failures of other notebooks caused by circumstances unrelated to that PR (such as by updates to the module, libraries, collections, and infrastructure).

robbibt commented 4 years ago

@whatnick @benjimin Thanks for the great suggestions - I think a regularly-run (e.g. weekly or bi-weekly) test directly in the environment itself would be a huge advance on what we currently have, especially if the reports could be posted somewhere easily accessed like Slack (email being an OK but less ideal option). I also agree that Github integration might not be required... we could just get in the habit of only pushing develop to master on the day that the tests are run, once we verify that everything is working as it should be.

I guess this could be a solution for testing on the Sandbox, but we might need a related approach for the NCI too?

robbibt commented 4 years ago

We had a meeting to discuss automated testing on the DEA Notebooks repo yesterday, and these were the outcomes (additional notes here):

pytest --nbval-lax testing will be implemented as an Airflow task run daily, initially in dev environment but ideally also in public facing Sandbox
Will notify users about failed notebooks either via email or slack
Has potential to be expanded to NCI environment if Sandbox trial is successful

Core team tasks:

Setting up Airflow task added as a Core team task for PI as an extension of existing Airflow work (sprint X?)

Science team tasks:

Will update repository using a new tag (“no_testing”?) to flag interactive notebooks that should be excluded from testing
Will generate estimated times for notebooks to complete based on NCI testing to assist in interpreting test results

erialC-P commented 4 years ago

Following a full review of the current dea-notebooks repo, I have flagged the following notebooks with a new 'no_testing' flag, being notebooks that I think will fail automated testing. The flagged notebooks either 1) require interactive user input to allow the notebook to proceed to finish 2) contain a pip install --user call or 3) are sandbox incompatible. The associated pull request to merge these changes into the develop branch is 'Interactive notebooks #572' The candidate notebooks are:

SARA_SLC_download
Change_filmstrips
Crop_health
Mining_rehabilitation
All 7 notebooks associated with DEAWaterbodies in the Scientific_workflows folder of the repo

robbibt commented 4 years ago

I've just gone through and audited typical run times for all non-interactive notebooks (e.g. no no_testing tag) in preparation for the development of automated testing for dea-notebooks.

I originally did this on module load dea, but unfortunately discovered that many of our notebooks fail using that module (see "Failed on module load dea " below). Thankfully, most of the notebooks run correctly on module load dea/20200526, probably because this module is much closer to the Sandbox images. Once #583 and #581 have been addressed, all NCI-compatible notebooks should work correctly on module load dea/20200526 (which makes me think that module should be made the default on the NCI as a priority).

The longest run-time is 191 seconds for the "dea-notebooks/Real_world_examples/Change_detection.ipynb" notebook.

Beginners guide

dea-notebooks/Beginners_guide/01_Jupyter_notebooks.ipynb: 22 seconds
dea-notebooks/Beginners_guide/02_DEA.ipynb: 0 seconds
dea-notebooks/Beginners_guide/03_Products_and_measurements.ipynb: 0 seconds
dea-notebooks/Beginners_guide/04_Loading_data.ipynb: 9 seconds
dea-notebooks/Beginners_guide/05_Plotting.ipynb: 18 seconds
dea-notebooks/Beginners_guide/06_Basic_analysis.ipynb: 14 seconds
dea-notebooks/Beginners_guide/07_Parallel_processing_with_Dask.ipynb: 32 seconds

Frequently used code

dea-notebooks/Frequently_used_code/Analyse_multiple_polygons.ipynb: 9 seconds
dea-notebooks/Frequently_used_code/Animated_timeseries.ipynb: Failed on module load dea (dea/20200526: 50 seconds)
dea-notebooks/Frequently_used_code/Applying_WOfS_bitmasking.ipynb: 14 seconds
dea-notebooks/Frequently_used_code/Calculating_band_indices.ipynb: 9 seconds
dea-notebooks/Frequently_used_code/Contour_extraction.ipynb: Failed on module load dea (dea/20200526: 106 seconds)
dea-notebooks/Frequently_used_code/Exporting_GeoTIFFs.ipynb: 10 seconds
dea-notebooks/Frequently_used_code/Exporting_NetCDFs.ipynb: 6 seconds
dea-notebooks/Frequently_used_code/Geomedian_composites.ipynb: 116 seconds
dea-notebooks/Frequently_used_code/Image_segmentation.ipynb: Failed on module load dea and dea/20200526 (documented by #583)
dea-notebooks/Frequently_used_code/Imagery_on_web_map.ipynb: 7 seconds
dea-notebooks/Frequently_used_code/Integrating_external_data.ipynb: 9 seconds
dea-notebooks/Frequently_used_code/Machine_learning_with_ODC.ipynb: Failed on module load dea and dea/20200526 (see #581)
dea-notebooks/Frequently_used_code/Masking_data.ipynb: 60 seconds
dea-notebooks/Frequently_used_code/Opening_GeoTIFFs_NetCDFs.ipynb: Failed on module load dea (dea/20200526: 7 seconds)
dea-notebooks/Frequently_used_code/Pan_sharpening_Brovey.ipynb: Failed on module load dea (dea/20200526: 7 seconds)
dea-notebooks/Frequently_used_code/Rasterize_vectorize.ipynb: Failed on module load dea (dea/20200526: 7 seconds)
dea-notebooks/Frequently_used_code/Tidal_modelling.ipynb: Failed on module load dea (dea/20200526: 125 seconds)
dea-notebooks/Frequently_used_code/Using_load_ard.ipynb: 14 seconds
dea-notebooks/Frequently_used_code/Virtual_products.ipynb: 68 seconds
dea-notebooks/Frequently_used_code/Working_with_time.ipynb: 26 seconds

Real world examples dea-notebooks/Real_world_examples/Change_detection.ipynb: 191 seconds dea-notebooks/Real_world_examples/Chlorophyll_monitoring.ipynb: 52 seconds dea-notebooks/Real_world_examples/Coastal_erosion.ipynb: Failed on module load dea (dea/20200526: 160 seconds) dea-notebooks/Real_world_examples/Intertidal_elevation.ipynb: Failed on module load dea (dea/20200526: 33 seconds)

Ping @Kirill888 @alexgleith

whatnick commented 3 years ago

@robbibt Please see associated issues #770 for optional regression testing. Also #771 as a sprint candidate for removing manual input dependencies. I would like someone on the science teams to work with me to try out the notebook testing in airflow DAG's soon.

robbibt commented 2 years ago

We now have a minimal set of automated tests on this repo, so I think we can close this for now

GeoscienceAustralia / dea-notebooks

Implement automated notebook testing #507