NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
5 stars 0 forks source link

Covariates: Functions for downloading covariate data #25

Closed mitchellmanware closed 10 months ago

mitchellmanware commented 1 year ago
mitchellmanware commented 1 year ago

Example: NCEP North American Regional Reanalysis

Notes

## Script to write 'wget' commands for batch downloading of covariate data from
## NCEP North America Regional Reanalysis
## Metadata and links available at: 
## https://psl.noaa.gov/data/gridded/data.narr.html
## Mitchell Manware
## September 1, 2023

## example url for format structure
## https://downloads.psl.noaa.gov//Datasets/NARR/Dailies/monolevel/weasd.1979.nc

## define base variable for building URLs
base <- "https://psl.noaa.gov//Datasets/NARR/Dailies/"

## define list of years
years <- c("2018", "2019", "2020", "2021", "2022")

## define list of months
months <- stringr::str_pad(as.character(1:12), 2, pad="0")

## define list of variables obtained from NCEP-NARR Reanalysis
variables <- c("monolevel/weasd.",     # accumulated snow
               "monolevel/evap.",      # accumulated total evaporation
               "monolevel/apcp.",      # accumulated total precipitation
               "monolevel/air.sfc.",   # surface air temperature
               "monolevel/albedo.",    # surface albedo
               "monolevel/tcdc.",      # total cloud coverage
               "monolevel/dswrf.",     # downward shortwave radiation flux
               "monolevel/hcdc.",      # high cloud area fraction
               "monolevel/lhtfl.",     # latent heat flux
               "monolevel/lcdc.",      # low cloud area fraction
               "monolevel/mcdc.",      # medium cloud area fraction
               "pressure/omega.",      # omega
               "monolevel/hpbl.",      # planetary boundary layer height
               "monolevel/pr_wtr.",    # precipitable water for entire atmosphere
               "monolevel/prate.",     # precipitation rate
               "monolevel/pres.sfc.",  # pressure at surface
               "monolevel/shtfl.",     # sensible heat flux
               "monolevel/snowc.",     # snow cover
               "monolevel/soilm.",     # soil moisture content
               "pressure/shum.",       # specific humidity
               "monolevel/uwnd.10m.",  # u-wind
               "monolevel/ulwrf.sfc.", # upward longwave radiation flux
               "monolevel/vwnd.10m.",  # v-wind
               "monolevel/vis.")       # visibility)

## initiate NCEP-NARR-Reanalysis_wget_commands.txt file
sink("/ddn/gs1/home/manwareme/NRT-AP-Model/code/Data_Download/NCEP-NARR-Reanalysis_wget_commands.txt")

## for loop to download data for each variable
for(v in 1:length(variables)){

  if(stringr::str_starts(variables[v], "m") == TRUE){

    ## define current variable
    variable <- variables[v]

    ## define folder to save data based on current variable
    folder <- sub(".*/", "", variable)

    for(y in 1:length(years)){

      ## define url
      url <- paste0(base, variable, years[y], ".nc")

      ## define command to be saved to text file
      command <- paste0("wget --no-check-certificate -P /ddn/gs1/home/manwareme/NRT-AP-Model/input/NCEP-NARR-Reanalysis/",
                        folder, " ", url, "\n")

      ## test by printing command
      print(command)

      ## add commands to text file
      cat(command, file = "NCEP-NARR-Reanalysis_wget_commands.txt", sep = '"\n"',
          append = TRUE)
    }

  } else {

    ## month indicator required for data from "Dailies/pressure/"
    ## example: August 2023 is 202308

    ## define current variable
    variable <- variables[v]

    ## define folder to save data based on current variable
    folder <- sub(".*/", "", variable)

    for(y in 1:length(years)){

      for(m in 1:length(months)){

        ## define url
        url <- paste0(base, variable, years[y], months[m], ".nc")

        ## define command to be saved to text file
        command <- paste0("wget --no-check-certificate -P /ddn/gs1/home/manwareme/NRT-AP-Model/input/NCEP-NARR-Reanalysis/",
                          folder, " ", url, "\n")

        ## test by printing command
        print(command)

        ## add commands to text file
        cat(command, file = "NCEP-NARR-Reanalysis_wget_commands.txt", sep = '"\n"',
            append = TRUE)

      }

    }

  }

}

## finish NCEP-NARR-Reanalysis_wget_commands.txt file
sink()

#!/bin/bash

R CMD BATCH NCEP-NARR-Reanalysis_download.R
. NCEP-NARR-Reanalysis_wget_commands.txt
sigmafelix commented 11 months ago

@mitchellmanware

Thank you for working on making all covariate data download functions. For my PEGS project, I tried using download_noaa_hms_smoke_data.R for an extended period over 10 years. I found a HTTP 404 error when I tried a code below:

download_noaa_hms_smoke_data(
    date_start = startdate,
    date_end = enddate,
    directory_to_download = "/Users/songi2/Documents/input/noaa_hms/raw/",
    directory_to_save = "/Users/songi2/Documents/input/noaa_hms/shapefile/",
    data_download_acknowledgement = TRUE,
    remove_download = FALSE,
    time_wait_download = 1L
)
# Downloading requested files...
# Requested files downloaded.
# Unzipping shapefiles to /Users/songi2/Documents/input/noaa_hms/shapefile/...
# Files unzipped and saved in/Users/songi2/Documents/input/noaa_hms/shapefile/.
# There were 50 or more warnings (use warnings() to see the first 50)
# > warnings()
# Warning messages:
# 1: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080821.zip': HTTP status was '404 Not Found'
# 2: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080902.zip': HTTP status was '404 Not Found'
# 3: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080501.zip': HTTP status was '404 Not Found'
# 4: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080502.zip': HTTP status was '404 Not Found'
# 5: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080505.zip': HTTP status was '404 Not Found'
# 6: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080506.zip': HTTP status was '404 Not Found'
# 7: In download.file(file_urls, download_names, method = "libcurl",  ... :
#   cannot open URL 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2008/12/hms_smoke20080507.zip': HTTP status was '404 Not Found'
# [truncated to save space ...]

I could find the files on the dates in the final month (December 2008) in the target directory. I think that the looped local variables year and month do not affect date_sequence vector to generate links (i.e., Lines 54-71 in the link). If this is a design choice, how about adding this to the roxygen description?

The other thing I noticed is that Sys.sleep I added does not make sense as the download.file function operates on the URL vector. I will make a patch for these and will make a pull request.

mitchellmanware commented 11 months ago

@sigmafelix Thank you for bringing this up. I believe the issue is stemming from #### 2. define data download URLs and download names. The URLs are built properly, but they are not saved to file_urls properly because it uses an incorrect date/year combination with the date_sequence[f]. Working on this now

mitchellmanware commented 11 months ago

@sigmafelix Can you try the patched function on branch mm_noaa_smoke_patch_1103. Updated lines should accommodate input ranges that span different months and years

sigmafelix commented 11 months ago

Thank you @mitchellmanware for the prompt response. The patch works well. I directly pushed a change at the file in your patch branch to remove Sys.sleep.

mitchellmanware commented 11 months ago

@sigmafelix Great, thank you. The patch branch I created got somewhat muddled with un-merged changes on another branch, so I will clean this up and create a new pull request with both of our changes to download_noaa_smoke_data.R.

mitchellmanware commented 11 months ago

New pull requested created after adding lintr requirements to download_noaa_smoke_data.R

sigmafelix commented 11 months ago

@mitchellmanware

When running download_noaa_hms_smoke_data multiple times* In the same session, I found that the unzipped shapefiles got a string Shapefile appended its name at each iteration. It will not allow unzipping shapefiles to the target directory due to the path length limitation (maybe 255 characters). I think the name error came from line 149 at file as the file format is already included in zip file names. In addition to fixing this line, perhaps an additional argument of unzip would be helpful to allow users to keep zip files only.

*The reason I ran the function multiple times (6 months per run) was that bulk download (e.g., for 1+ years) returned network errors.

mitchellmanware commented 11 months ago

@sigmafelix

The appending of "Shapefile" to the file name was to help differentiate the reading of shapefiles and KML files within the same directory. I will work to change this.

The remove_download argument in the function is equal to your recommended unzip argument. The default is set to TRUE (removes zip files), but as in your experience having the default set to retain zip files may be better. I will check back in soon.

Update:: I understand what you are suggesting about the unzip argument. I will include with other changes

mitchellmanware commented 11 months ago

@sigmafelix

The following changes have been made to the download_noaa_hms_smoke_data function

  1. Removed the "file.rename()` step to avoid "hms_smoke_Shapefile_Shapefile..." if function used iteratively
  2. Added unzip = argument if user wishes to download only zip files
  3. Changed default setting of remove_zip to FALSE so user cannot unknowingly remove the zip files.
  4. Removed unnecessary code related to data_format = "KML" because many of the arguments (unzip =, remove_zip =, directory_to_download =) do not apply.
  5. Removed the url_noaa_hms_smoke_data from arguments. User has no reason to edit the base used to form the download URLs. Changing can only make the function fail. Having as an argument has no benefit.

Please find all changes reflected in branch mm_noaa_download_patch_1114. I will wait to discuss further before any pull requests are made

sigmafelix commented 11 months ago

@mitchellmanware Thank you for the update. I tested the function and it works well. I pushed a minor change to your patch branch.

mitchellmanware commented 11 months ago

@sigmafelix

Thank you for the contribution. I will open a pull request shortly. On a related note, I am going back through all of the download functions to check for bugs similar to those identified in the download_noaa_hms_smoke_data function, to add an unzip argument (where applicable), and to set the remove_zip default to FALSE. I will create a new branch to patch these potential bugs.

sigmafelix commented 11 months ago

@mitchellmanware All look great. Thank you for the hard work!