Closed robbibt closed 2 years ago
Even a manual script we could run (on Sandbox/NCI) to re-run every notebook in the repo and report passes/fails before we merge develop
> master
would be better than what we currently have
Ideal solution, I think, is to have a minimal subset database that we automatically compute and keep up to date by running against real database in some instrumented way that simply records all the datasets that notebooks under test accesses and then recreates database with just those datasets and products and metadata. Then we can use something like github actions to run tests against this. Note this will also be useful for local dev work and for running tests under our own infrastructure, so that tests do no impact database performance.
Tagging @omad @uchchwhash @caitlinadams @alexgleith @emmaai @vnewey @GypsyBojangles @whatnick in case any of you have any thoughts - this is probably a medium/long term goal, but it's the kind of thing that could help a lot for reducing the amount of manual notebook re-running and code checking we currently need to do every time a function is altered or a new module version updated
heavily support ++++
This collection of notebooks and scripts is a vital piece of infrastructure for ourselves and our external stakeholders, and I fully support any measure to reduce the workload on maintaining the repo. I personally don't know anything about automated testing, but I'm happy to be allocated any tasks if/when this method materializes (I will also implement any solution on the de-africa repo).
For typical automated tests, you specify the pass/fail test very clearly. What do you expect in the case of a notebook?
For example, if more data is added to the index over time, a plot might look slightly different. Should that still pass?
It is probably fairly easy though to test that all notebooks run with no uncaught error messages.
I think it's also worth exploring whether the scripts should be converted to an installable python package/module (possibly through pip). This would likely help with maintenance, make importing them simpler, and mean that the useful tools could be incorporated easily into other DEA/ODC deployments.
@benjimin For this, I think simple is probably best, e.g. perhaps even whether a notebook returned an error at any point in its run. The biggest problem we have is that changes to the environment/scripts cause notebooks to break, which then lies dormant until a stakeholder tries it themselves and gets errors, which isn't a great look.
If we could even get a simple list of passed/failed notebooks, that would help greatly in cutting down the work-load of finding all the relevant notebooks to update after each change.
@robbibt I think one can start with "manually generate success/failure report" with a single script launch after manually logging into sandbox/NCI.
Should be relatively straightforward to set up Travis to report if any notebooks are emitting error messages:
pytest --nbval-lax
What about notebooks that are slow to execute however?
@benjimin they all will report errors on travis, because database
Well, @robbibt could run:
pip install --user nbval
pytest --nbval-lax
For example: the beginner notebooks pass (aside from a deprecation warning), the initial template fails (because the os.path
manipulations do not expect it to be at the root level), and the real world examples probably require a pytest --timeout=
..
@benjimin That's very promising: it works well and gives a good summary of the issues:
(some of those fails haven't been reported yet, so good proof of concept for the kinds of things we want to pick up on!)
We'd probably need to use that list in the context of what notebooks are meant to be compatible with either the NCI or Sandbox, but it's a great start.
@robbibt regarding failures:
Opening_GeoTIFFs_NetCDFs.ipynb
failure are due to env being bad see here:
https://github.com/opendatacube/jupyterhub/issues/37
runs fine with work-around (downgraded netcdf lib)Exporting_NetCDFs.ipynb
is also due to bad env setup, but also might fail if output file was not cleaned up from previous partial runs as I think write
function will fail if file exists.Image_segmentation.ipynb
failures are also due to environment -- no support for kea files in the sandbox.@alexgleith we can probably assemble a good env smoke test
for testing sandbox docker deploys from a lot of these notebooks, problem is db access and s3 to a lesser extent I suppose too. Also some take long time to complete, so we will need to a pick a good set (fast and uses libraries we don't use directly in datacube)
DB access could be done by indexing ONLY the datasets that are required into a Postgres image, either fixed or ... routinely indexed. Perhaps that's a good excuse to write a nice indexing Docker image? It can be tested by indexing the data required by ☝️. Not sure how to get that to result in an actual postgres DB image, but it's plausible.
We can test against S3 data easy enough, so long as we set the AWS_NO_SIGN_REQUEST
flag and the data is public.
@Kirill888 Yep, some of those fails are to be expected as not all notebooks are written to be compatible with the Sandbox or NCI (e.g. Image_segmentation.ipynb
)... the errors with the two NetCDF notebooks though are exactly the kind of thing that would be great to be able to automatically catch.
For processing time, a lot of the notebooks have recently become slower with Landsat C3 data now coming from THREDDS... we should revisit some of them and make sure the data loads are as minimal as possible.
@alexgleith Using this image:
opendatacube/sandbox
https://hub.docker.com/layers/opendatacube/sandbox/latest/images/sha256-a14601514daf26e8374ffadd4bb89ad508dd8e46e6c143082b1014b067c5d6e0?context=repo
I get errors only in the segmentation notebook which requires rsgislib which is a conda only library
jovyan@notebook-7844f944b6-nnckr:~/dea-notebooks/Frequently_used_code$ AWS_NO_SIGN_REQUEST=true pytest --nbval-lax
============================================================================= test session starts =============================================================================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/jovyan/dea-notebooks/Frequently_used_code
plugins: nbval-0.9.5
collected 261 items
Animated_timeseries.ipynb ................ [ 6%]
Applying_WOfS_bitmasking.ipynb ................. [ 12%]
Calculating_band_indices.ipynb ......... [ 16%]
Contour_extraction.ipynb .............. [ 21%]
Exporting_GeoTIFFs.ipynb .......... [ 25%]
Exporting_NetCDFs.ipynb ......... [ 28%]
Geomedian_composites.ipynb ........ [ 31%]
Image_segmentation.ipynb F.FFFFFFF.FFFF. [ 37%]
Imagery_on_web_map.ipynb ............ [ 42%]
Integrating_external_data.ipynb ............ [ 46%]
Machine_learning_with_ODC.ipynb .................... [ 54%]
Masking_data.ipynb .............. [ 59%]
Opening_GeoTIFFs_NetCDFs.ipynb ............ [ 64%]
Pan_sharpening_Brovey.ipynb ............. [ 69%]
Rasterize_vectorize.ipynb .......... [ 73%]
Using_load_ard.ipynb ............... [ 78%]
Virtual_products.ipynb ........................................... [ 95%]
Working_with_time.ipynb ............ [100%]
================================================================================== FAILURES ===================================================================================
That's positive, @Kirill888! How do you run it? On the actual Sandbox?
Yep, it's looking really useful for what we're after. The full test suite would also include the following directories:
Beginners_guide
DEA_datasets
Frequently_used_code
Real_world_examples
Some of the Real_world_examples
might need to be shortened for run time.
An interesting question I haven't investigated yet: how does this work for interactive notebooks that require user input before they continue through the code? I imagine they simply won't complete, rather than return an error...
@alexgleith I can't get that image into actual sandbox, and dev sandbox is too different database-wise, I'm running via custom deployment in kk
namespace
=============================== test session starts ===============================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/jovyan/dea-notebooks/Beginners_guide
plugins: nbval-0.9.5
collected 55 items
01_Jupyter_notebooks.ipynb ... [ 5%]
03_Products_and_measurements.ipynb ...... [ 16%]
04_Loading_data.ipynb ................ [ 45%]
05_Plotting.ipynb .................. [ 78%]
06_Basic_analysis.ipynb ............ [100%]
================================ warnings summary =================================
In DEA_datasets
Seintel_2 succeeds but TSmask is taking forever so far.
@Kirill888 That might be because the TSmask product is being worked on (I remember @uchchwhash saying it wouldn't be accessibly for a while)
no it's not that, it's just slow, anyway it failed now. Changes were made to product definition: masking flags are now consistent with fmask
so this notebook needs updating.
I wonder if a reduced database index would be impractical because one purpose of the notebooks is to demonstrate diverse insights that utilise different specific slices of the data archive. (I also think it would be easier to monkey patch in a mock Datacube.load
to always return say a dozen layers thick of numpy.random
. Still, not hard to imagine notebooks where dummy data would trip them into an exception.)
Is there any way to configure Travis (or some other github hook service) to ssh into both NCI and sandbox, and test against the real data? (This seems like a more relevant test.)
AWS_NO_SIGN_REQUEST=true pytest --nbval-lax
================================ test session starts ================================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/jovyan/dea-notebooks/Frequently_used_code
plugins: nbval-0.9.5
collected 262 items
Animated_timeseries.ipynb ................ [ 6%]
Applying_WOfS_bitmasking.ipynb ................. [ 12%]
Calculating_band_indices.ipynb ......... [ 16%]
Contour_extraction.ipynb .............. [ 21%]
Exporting_GeoTIFFs.ipynb .......... [ 25%]
Exporting_NetCDFs.ipynb ......... [ 28%]
Geomedian_composites.ipynb ........ [ 31%]
Image_segmentation.ipynb ................ [ 37%]
Imagery_on_web_map.ipynb ............ [ 42%]
Integrating_external_data.ipynb ............ [ 46%]
Machine_learning_with_ODC.ipynb .................... [ 54%]
Masking_data.ipynb .............. [ 59%]
Opening_GeoTIFFs_NetCDFs.ipynb ............ [ 64%]
Pan_sharpening_Brovey.ipynb ............. [ 69%]
Rasterize_vectorize.ipynb .......... [ 73%]
Using_load_ard.ipynb ............... [ 79%]
Virtual_products.ipynb ........................................... [ 95%]
Working_with_time.ipynb ............ [100%]
================================= warnings summary ==================================
Animated_timeseries.ipynb::Cell 0
/env/lib/python3.6/site-packages/jupyter_client/manager.py:62: DeprecationWarning: KernelManager._kernel_spec_manager_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
def _kernel_spec_manager_changed(self):
-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================== 262 passed, 1 warning in 1607.01s (0:26:47) ====================
With rgsislib
installed in this docker:
https://hub.docker.com/layers/opendatacube/sandbox/latest/images/sha256-da54e19602bb635abfa6671353d64a1d0ce0f008a00fd54daae3c39cbb141c71?context=repo
Required a small patch to Image_segmentation.ipynb
, see #518.
From an infrastructure perspective. It is difficult to reproduce the sandbox single-user containers in a 3rd-party CI with injected environment variables and DB access included. However we can move the code the environment and run a daily kubernetes cron job to checkout notebooks and run jupyter nbconvert --to notebook --execute sample_notebook.ipynb
on them within the cluster. This will take a while and not work for interactive ones. However it will solve the manual validation currently taking place in one hit.
Another reason for testing against the real data is to notice if notebooks are broken by changes in curated data collections. Two possible alternatives that I think would be compatible with our workflows:
Use Travis: Get a Travis public key, modify .travis.yaml
to contain encrypted credentials (which will automatically be decrypted into environment variables) and an ssh command (to run tests on our platform and report back the exit code). For security reasons I think Travis will only report on pull requests from our internal branches, and not from other forks.
Don't integrate with GitHub. Make something behave kind of like a cron job, run twice per week, and broadcast a condensed summary result to a popular slack channel. That way a person who happens to do a minor PR is not the one having to deal with the sudden failures of other notebooks caused by circumstances unrelated to that PR (such as by updates to the module, libraries, collections, and infrastructure).
@whatnick @benjimin Thanks for the great suggestions - I think a regularly-run (e.g. weekly or bi-weekly) test directly in the environment itself would be a huge advance on what we currently have, especially if the reports could be posted somewhere easily accessed like Slack (email being an OK but less ideal option). I also agree that Github integration might not be required... we could just get in the habit of only pushing develop
to master
on the day that the tests are run, once we verify that everything is working as it should be.
I guess this could be a solution for testing on the Sandbox, but we might need a related approach for the NCI too?
We had a meeting to discuss automated testing on the DEA Notebooks repo yesterday, and these were the outcomes (additional notes here):
pytest --nbval-lax
testing will be implemented as an Airflow task run daily, initially in dev environment but ideally also in public facing SandboxCore team tasks:
Science team tasks:
Following a full review of the current dea-notebooks repo, I have flagged the following notebooks with a new 'no_testing' flag, being notebooks that I think will fail automated testing. The flagged notebooks either 1) require interactive user input to allow the notebook to proceed to finish 2) contain a pip install --user call or 3) are sandbox incompatible. The associated pull request to merge these changes into the develop branch is 'Interactive notebooks #572' The candidate notebooks are:
I've just gone through and audited typical run times for all non-interactive notebooks (e.g. no no_testing
tag) in preparation for the development of automated testing for dea-notebooks
.
I originally did this on module load dea
, but unfortunately discovered that many of our notebooks fail using that module (see "Failed on module load dea
" below). Thankfully, most of the notebooks run correctly on module load dea/20200526
, probably because this module is much closer to the Sandbox images. Once #583 and #581 have been addressed, all NCI-compatible notebooks should work correctly on module load dea/20200526
(which makes me think that module should be made the default on the NCI as a priority).
The longest run-time is 191 seconds for the "dea-notebooks/Real_world_examples/Change_detection.ipynb" notebook.
Beginners guide
Frequently used code
module load dea
(dea/20200526
: 50 seconds)module load dea
(dea/20200526
: 106 seconds)module load dea
and dea/20200526
(documented by #583)module load dea
and dea/20200526
(see #581)module load dea
(dea/20200526
: 7 seconds)module load dea
(dea/20200526
: 7 seconds)module load dea
(dea/20200526
: 7 seconds)module load dea
(dea/20200526
: 125 seconds)Real world examples
dea-notebooks/Real_world_examples/Change_detection.ipynb: 191 seconds
dea-notebooks/Real_world_examples/Chlorophyll_monitoring.ipynb: 52 seconds
dea-notebooks/Real_world_examples/Coastal_erosion.ipynb: Failed on module load dea
(dea/20200526
: 160 seconds)
dea-notebooks/Real_world_examples/Intertidal_elevation.ipynb: Failed on module load dea
(dea/20200526
: 33 seconds)
Ping @Kirill888 @alexgleith
@robbibt Please see associated issues #770 for optional regression testing. Also #771 as a sprint candidate for removing manual input dependencies. I would like someone on the science teams to work with me to try out the notebook testing in airflow DAG's soon.
We now have a minimal set of automated tests on this repo, so I think we can close this for now
As this repository continues to grow, a larger and larger amount of time is going to be spent re-running notebooks and verifying that all our functionality still works after every change to the DEA environments and our scripts/functions.
This seems like a natural fit for some kind of automated testing, e.g. to automatically re-run all notebooks after changes are pushed to
develop
to verify that updates haven't broken any notebooks before we ultimately updatemaster
.Similarly, being able to automatically re-build a test version of the DEA User Guide from(added as of #719 )develop
would also greatly assist in reducing formatting errors and re-writes required to generate the documentation.This issue is to capture discussion around the practicality of implementing automated testing on this repository. Any thoughts would be greatly appreciated!