drivendataorg / water-supply-forecast-rodeo-runtime

Data and runtime repository for the Water Supply Forecast Rodeo competition on DrivenData
https://watersupply.drivendata.org
MIT License
9 stars 18 forks source link

Water Supply Forecast Rodeo

Python 3.10 DrivenData Water Supply Forecast Rodeo

Welcome to the data and runtime repository the Water Supply Forecast Rodeo competition on DrivenData! This repository contains a few things:

  1. Data download code (data_download/) — a Python package with a code and a CLI for downloading data for each approved feature data source. DrivenData will download datasets from certain approved data sources and mount it to the competition runtime for code execution submissions. Use the CLI to reproduce the saved file structure in the runtime.
  2. Data reading code (data_reading/) — a Python library with example code for loading each of the feature datasets downloaded by the data download package, available for you to optionally use. It will be installed in the code execution runtime environment and you will be able to import it.
  3. Submission template (examples/template/) — a template with the function signatures that you should implement in your submission
  4. Example submission (examples/moving_average/) — a submission with a simple demonstration solution. It will run successfully in the code execution runtime and outputs a valid submission.
  5. Runtime environment specification (runtime/) — the definition of the environment where your code will run.

You can use this repository to:

⬇️ Get feature data: The same code that is used to get feature data for the runtime environment is available for you to use locally.

🔧 Test your submission: Test your submission using a locally running version of the competition runtime to discover errors before submitting to the competition website.

📦 Request new packages in the official runtime: Since your submission will not have general access to the internet, all dependencies must be pre-installed. If you want to use a package that is not in the runtime environment, make a pull request to this repository. Make sure to test out adding the new package to both official environments, CPU and GPU.

Changes to the repository are documented in CHANGELOG.md.


1. Data download

2. Data reading

3. Testing a submission locally

4. Updating runtime packages

5. Makefile commands


Data download

This repo contains a Python package named wsfr-download located in the data_download/ directory. It provides a command-line interface (CLI) for downloading approved challenge datasets. DrivenData will use this package to download the test feature data that will be made available to the code execution runtime. You can use it to download feature data in the same way for testing your submission or for training.

[!NOTE] Data download code may be added for requested data sources that get approved.

Requirements and installation

Requires Python 3.10. To install with the exact dependencies that will be used by DrivenData, create a new virtual environment and run

pip install -r ./data_download/requirements.txt
pip install ./data_download/

By default, data is saved into a subdirectory named data/ relative to your current working directory. You can explicitly override this by setting the environment variable WSFR_DATA_ROOT with another directory path. The expected default usage is that you run all commands with the root directory of this repository as your working directory.

You will also need to download the following files from the competition data download page and place them into your data directory. The data download scripts will depend on some of these files.

Additionally, the following data products are static releases and involve large single files. If you are planning to use any of the following datasets, please manually download it from its approved source and move it into the designated location.

You will need at least 115 GB in free disk space to download all datasets. See the "Expected files" section below for a breakdown by data source.

Usage

To simply download all test feature data that will be available, use the bulk command. From the repository root as your working directory, run:

python -m wsfr_download bulk data_download/hindcast_test_config.yml

Details

You can invoke the CLI with python -m wsfr_download. For example, to see a list of all available commands:

python -m wsfr_download --help

The CLI is organized with one command per data source, e.g., see

python -m wsfr_download grace_indicators --help

There is also the bulk command for downloading multiple data sources at once, as shown in the previous section. A bulk download is configured by a YAML configuration file. The configuration file for the Hindcast test set is data_download/hindcast_test_config.yml. To download feature data for training, create your own YAML configuration file for the years and data sources that you need using the test set file as an example.

By default, all download functions will skip downloading data for files that already exist in your data directory. This is controlled by an option called skip_existing. To force downloads to overwrite existing files, set skip_existing to false in the bulk download config file when using the bulk command, or use the --no-skip-existing flag when using an individual data source's download command.

Expected files

A list of all files present in the runtime data volume is available in data.find.txt. You can generate an equivalent version of this file for your local data directory with the following command:

find data -type f ! -name '.DS_Store' ! -name '.gitkeep' | sort

You can also find a listing of subdirectory sizes in data.du.txt, which will give you an idea of the disk space needed for each data source. You can generate an equivalent version of this file for your local data directory with the following command:

du -sh data/*

Data reading

This repo contains a Python package named wsfr-read located in the data_reading/ directory. It provides a library with example functions to read the data downloaded by wsfr-download. This package will be installed into the code execution runtime for you to optionally use during inference on the test set. These functions may be helpful because they implement subsetting by site_id and issue_date. You are not required to use these functions in your solution.

[!NOTE] Data reading code may be added for requested data sources that get approved.

Requirements and installation

Requires Python 3.10. Install with pip:

pip install ./data_reading/

Usage

Modules are provided with names matching the data source names in the wsfr-download package. Each module contains read_*_data functions that are basic ways you can load that data for use as features for your models. See the docstrings on the functions for more details on usage.

By default, data is assumed to be in a subdirectory named data/ relative to your current working directory. You can explicitly override this by setting the environment variable WSFR_DATA_ROOT.

Testing a submission locally

When you make a submission on the DrivenData competition site, we run your submission inside a Docker container, a virtual operating system that allows for a consistent software environment across machines. The best way to make sure your submission to the site will run is to first run it successfully in the container on your local machine.

Prerequesites

Additional requirements to run with GPU:

Setting up the data directory

In the official code execution platform, code_execution/data will contain data provided for the test set. This will include data from the data download page as well as feature data downloaded by the data pipelines in data_download/. See the data download section for more about setting up the test data.

In additional to the files detailed in the data download section, you will also need the following additional two files from the data download page:

When testing your submission locally, the data/ directory in the repository root will be mounted into the container. You can explicitly override this by setting the environment variable WSFR_DATA_ROOT with another directory path.

Code submission format

Your final submission should be a zip archive named with the extension .zip (for example, submission.zip). The root level of the submission.zip file must contain a solution.py which contains a predict function that returns predictions for a single site on a single issue date.

A template for solution.py is included at examples/template/solution.py. For more detail, see the "what to submit" section of the code submission page.

Running your submission locally

This section provides instructions on how to run the your submission in the code execution container from your local machine. To simplify the steps, key processes have been defined in the Makefile. Commands from the Makefile are then run with make {command_name}. The basic steps are:

make pull
make pack-submission
make test-submission

Run make help for more information about the available commands as well as information on the official and built images that are available locally.

Here's the process in a bit more detail:

  1. First, make sure you have set up the prerequisites.
  2. Download the official competition Docker image:

    make pull

[!NOTE] If you have built a local version of the runtime image with make build, that image will take precedence over the pulled image when using any make commands that run a container. You can explicitly use the pulled image by setting the SUBMISSION_IMAGE shell/environment variable to the pulled image or by deleting all locally built images.

  1. Save all of your submission files, including the required solution.py script, in the submission_src folder of the runtime repository. Make sure any needed model weights and other assets are saved in submission_src as well.

  2. Create a submission/submission.zip file containing your code and model assets:

    make pack-submission
    #> mkdir -p submission/
    #> cd submission_src; zip -r ../submission/submission.zip ./*
    #>   adding: solution.py (deflated 73%)
  3. Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime:

    make test-submission

This runs the container entrypoint script. First, it unzips submission/submission.zip into /code_execution/src/ in the container. Then, it runs the supervisor.py script, which will import code from your submitted solution.py. In the local testing setting, the final submission is saved out to submission/submission.csv on your local machine.

When you run make test-submission the logs will be printed to the terminal and written out to submission/log.txt. If you run into errors, use the log.txt to determine what changes you need to make for your code to execute successfully.

Example submission

An example code submission is provided in examples/moving_average that can run successfully and generate valid predictions. Please note that this model is not a realistic solution to the problem. You can use the example in place of steps 3 and 4 above. To pack this submission for testing or for submission to the platform, run:

make pack-example

Smoke tests

When submitting on the platform, you will have the ability to submit "smoke tests". Smoke tests run on a reduced version of the test set in order to run more quickly. They will not be considered for prize evaluation and are intended to let you test your code for correctness.

Smoke tests use the smoke_submission_format.csv file instead of the full submission_format.csv file. When testing locally, a submission will run as a smoke test if the IS_SMOKE shell variable is set to a non-empty string. For example,

IS_SMOKE=1 make test-submission

You can read more about smoke tests on the code submission format page.

Runtime network access

In the real competition runtime, all internet access is blocked except to the hosts documented in allowed_hosts.txt corresponding to the approved data sources labeled with "Direct API access permitted" on the Approved data sources page.

The local test runtime does not impose any network restrictions; as a result submissions that require internet access might succeed in local tests but fail in the actual competition runtime. It's up to you to make sure that your code does not make requests to unauthorized web resources. If your submission does not require internet access, you can test your submission without internet access by running BLOCK_INTERNET=true make test-submission.

Updating runtime packages

If you want to use a package that is not in the environment, you are welcome to make a pull request to this repository. If you're new to the GitHub contribution workflow, check out this guide by GitHub.

The runtime manages dependencies using conda environments and conda-lock. Here is a good general guide to conda environments. The official runtime uses Python 3.10.13 environments.

To submit a pull request for a new package:

  1. Fork this repository.

  2. Install conda-lock. See here for installation options.

  3. Edit the conda environment YAML files, runtime/environment-cpu.yml and runtime/environment-gpu.yml. There are two ways to add a requirement:

    • Conda package manager (preferred): Add an entry to the dependencies section. This installs from the conda-forge channel using conda install. Conda performs robust dependency resolution with other packages in the dependencies section, so we can avoid package version conflicts.
    • Pip package manager: Add an entry to the pip section. This installs from PyPI using pip, and is an option for packages that are not available in a conda channel.
  4. Run make update-lockfiles. This will read environment-cpu.yml and environment-gpu.yml, resolve exact package versions, and save the pinned environments to conda-lock-cpu.yml and conda-lock-gpu.yml.

  5. Locally test that the Docker image builds successfully for CPU and GPU images:

    CPU_OR_GPU=cpu make build
    CPU_OR_GPU=gpu make build
  6. Commit the changes to your forked repository. Ensure that your branch includes updated versions of all of the following:

    • runtime/conda-lock-cpu.yml
    • runtime/conda-lock-gpu.yml
    • runtime/environment-cpu.lock
    • runtime/environment-cpu.yml
    • runtime/environment-gpu.lock
    • runtime/environment-gpu.yml
  7. Open a pull request from your branch to the main branch of this repository. Navigate to the Pull requests tab in this repository, and click the "New pull request" button. For more detailed instructions, check out GitHub's help page.

  8. Once you open the pull request, we will use Github Actions to build the Docker images with your changes and run the tests in runtime/tests. For security reasons, administrators may need to approve the workflow run before it happens. Once it starts, the process can take up to 30 minutes, and may take longer if your build is queued behind others. You will see a section on the pull request page that shows the status of the tests and links to the logs.

  9. You may be asked to submit revisions to your pull request if the tests fail or if a DrivenData staff member has feedback. Pull requests won't be merged until all tests pass and the team has reviewed and approved the changes.

Make commands

A Makefile with several helpful shell recipes is included in the repository. The runtime documentation above uses it extensively. Running make by itself in your shell will list relevant Docker images and provide you the following list of available commands:

Available commands:

build               Builds the container locally
clean               Delete temporary Python cache and bytecode files
interact-container  Open an interactive bash shell within the running container (with network access)
pack-example        Creates a submission/submission.zip file from the source code in examples_src
pack-submission     Creates a submission/submission.zip file from the source code in submission_src
pull                Pulls the official container from Azure Container Registry
test-container      Ensures that your locally built image can import all the Python packages successfully when it runs
test-submission     Runs container using code from `submission/submission.zip` and data from WSFR_DATA_ROOT (default `data/`)
update-lockfiles    Updates runtime environment lockfiles