OCHA-DAP / ds-raster-pipelines

1 stars 0 forks source link

Mars seas5 grib download #2

Closed hannahker closed 2 months ago

hannahker commented 3 months ago

This PR adds code to download and process archival SEAS5 data from ECMWF's MARS service. The proposed directory structure and method of calling the pipeline is by no means set in stone -- I'm assuming this is something we'll iterate on further as the code base grows.

Usage:

The pipeline can be run locally from the command line by calling:

python run_mars.py <scope> <start_year> <end_year>

This code is also configured as a Job on Databricks, called "Update SEAS5 Archive". This can be triggered manually and has been used for bulk tasks (ie. more than a couple years) due to significantly improved performance.

Processing details

Raw files:

Global, monthly precipitation forecasts are downloaded in yearly .grib files. Each raw .grib contains all ensemble members (26 or 51, depending on the year) and lead times (0-6 months ahead). See this JIRA ticket for more detailed docs on how the MARS API call is parameterized. All raw .grib data is stored in the dev Azure storage container under global/mars/raw/. Files are named seas5_mars_tprate_{year}.grib.

Processed files:

The .grib file from each year is processed to output 84 cloud-optimized-geotiffs (.tif):

  1. Take the mean of all ensemble members
  2. Separate by publication month and lead time
  3. Set a CRS (EPSG:4326)

All processed files are saved to the prod Azure storage container under raster/seas5/ (TODO!). Files are named seas5_mars_tprate_em_i{pub_date}_lt{leadtime}.tif.

NOTE: Outputs will be saved to prod following this PR review!

hannahker commented 2 months ago

@t-downing @zackarno thanks both for the reviews! I've addressed most comments so should be ready for another review.

@t-downing also totally agree that having an organized structure across pipelines will be important. However, this isn't something that I want to spend too much time designing while we're still quite early in the process of setting up these pipelines -- there's still a lot we don't know! My thinking in this PR was to keep the set up and folder structure quite straightforward and lightweight so that we can iterate as we get a better sense of requirements over time. Once this PR and @isatotun's work on IMERG are complete, I think we'll be in a much better place to plan the best way to keep things organized.