environmentalinformatics-marburg / GSODTools

R functions to select, download, and process GSOD data.
Other
16 stars 5 forks source link

For the latest stable release, please refer to
doi\_GSODTools\_v1.0.0.

What it is all about

But to return to the actual topic: Every person dealing with long-term climatological data (e.g. of daily air temperature, relative humidity, and precipitation amounts) will sooner or later stumble across the Global Summary Of Day (GSOD) climate data collection provided by the National Oceanic and Atmospheric Association (NOAA). I’ve been recently looking for available GSOD stations in close vicinity to Mt. Kilimanjaro, Tanzania, and as I am trying to realize most of my coding work using R, I quickly noticed that there are only a few packages that provide convenient tools for processing GSOD data. Therefore, I started to write this package that includes both downloading data sets of selected climate stations for a given time span as well as some processing steps for quality assurance and gap filling.

Introducing the processing chain

Getting started

The starting point for each GSOD-related search query is the selection of a particular station (or even multiple stations). Although there are tools that allow interactive station selection and data acquisition, I thought it was a good thing to implement a couple of search functions to speed things up a little bit.

The GSODTools package comes with a built-in dataset from NOAA’s FTP server holding information about all available GSOD stations that is automatically attached via lazy-loading when loading the package. Let’s have a quick look at it.

##     USAF  WBAN STATION NAME CTRY STATE ICAO   LAT    LON ELEV(M)      BEGIN        END
## 1 007018 99999   WXPOD 7018                  0.00  0.000  7018.0 2011-03-09 2013-07-30
## 2 007026 99999   WXPOD 7026   AF             0.00  0.000  7026.0 2012-07-13 2017-08-22
## 3 007070 99999   WXPOD 7070   AF             0.00  0.000  7070.0 2014-09-23 2015-09-26
## 4 008260 99999    WXPOD8270                  0.00  0.000     0.0 2005-01-01 2012-07-31
## 5 008268 99999    WXPOD8278   AF            32.95 65.567  1156.7 2010-05-19 2012-03-23
## 6 008307 99999   WXPOD 8318   AF             0.00  0.000  8318.0 2010-04-21 2010-04-21

To transform the built-in dataset into a spatial object, either run sf::st_as_sf() manually or use the convenience function gsodDf2Sp(). Past inconveniences (elevation in decimeters, coordinates in thousandth of a degree, longitude and latitude outside of -180 to 180 and -90 to 90, respectively) are no longer an issue, which is why gsodReformat() has been deprecated and will eventually be removed from the package.

# Reformat data and convert to spatial object
gsod_shp <- gsodDf2Sp(data = gsodstations)

par(mar = c(0, 0, 0, 0))
plot(gsod_shp)

Selecting a station

Now that the list of available GSOD stations is in a reasonable format and holds spatial information, the next step would be to select a station you would like to download data from. Using the GIS Data Locator, this involves quite some clicking around until you finally reach the download page. GSODTools offers multiple functions to facilitate station selection and data acquisition, allowing the user to select stations based on spatial characteristics or by hand.

stationFromCoords takes a x (longitude) and y (latitude) coordinate as input, and returns all available GSOD stations that fall within a user-defined buffer around that location. Alternatively, a ‘SpatialPoints’ object may be provided rather than two separate numerics. For instance, let’s search for GSOD stations in a circle of 500 km around Kibo summit, Mt. Kilimanjaro, Tanzania. The referring coordinates are c(37.359031, -3.065053).

shp_kibo <- stationFromCoords(x = 37.359031, y = -3.065053, width = 500)
# or: stationFromCoords(x = c(37.359031, -3.065053), width = 500)
# or: stationFromCoords(x = SpatialPoints(data.frame(x = 37.359031, 
#                                                    y = -3.065053), 
#                                         proj4string = CRS("+init=epsg:4326")), 
#                       width = 500)

rworldmap::mapGriddedData(
  mapRegion = "africa"
  , plotData = FALSE
  , borderCol = "black"
  , addLegend = FALSE
)
points(sf::st_coordinates(shp_kibo), col = "red", pch = 20, cex = 2)

stationFromExtent, just like stationFromCoords, allows station selection based on spatial criteria. There are actually two options how to handle this function. If no bounding box is defined, the user is automatically prompted to manually draw an extent on a map rather than directly supplying specific coordinates. The advantage is that spatial selection is not performed in a circular shape, i.e. in a uniform distance around a given location, but depends on user preferences. With respect to the aforementioned example, this means that GSOD stations in the southern Mt. Kilimanjaro region could be selected rather than all stations in a given distance from the summit. Alternatively, an extent object from an arbitrary spatial object, e.g. ‘RasterLayer’, ‘SpatialPolygons’ etc, may be defined. In this case, drawExtent (which is actually quite difficult to include in a README file) is automatically disabled.

bbox_kibo_south <- sf::st_bbox(
  c(
    xmin = 36.6
    , xmax = 37.72
    , ymin = -3.5
    , ymax = -3.065053
  )
)
shp_kili_south <- stationFromExtent(bb = bbox_kibo_south)

rworldmap::mapGriddedData(
  mapRegion = "africa"
  , plotData = FALSE
  , borderCol = "black"
  , addLegend = FALSE
)
points(sf::st_coordinates(shp_kili_south), col = "red", pch = 20, cex = 2)

The third and, at the moment, final possibility to select a GSOD station is to simply choose a name from the built-in station list. This is, however, a quite tricky approach since you have to know the precise spelling of a station’s name. Again referring to the above example where we selected Arusha, Moshi, and Kilimanjaro International Airport (KIA), this would more or less look like this.

station_names <- c("ARUSHA", "KILIMANJARO INTL", "MOSHI")

(
  shp_kili_south <- 
    gsodstations |> 
    gsodReformat() |> 
    subset(`STATION NAME` %in% station_names) |> 
    gsodDf2Sp()
)

## Simple feature collection with 3 features and 9 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 36.633 ymin: -3.429 xmax: 37.333 ymax: -3.35
## Geodetic CRS:  WGS 84
##         USAF  WBAN     STATION NAME CTRY STATE ICAO ELEV(M)      BEGIN        END              geometry
## 13485 637890 99999           ARUSHA   TZ       HTAR  1386.8 1960-01-11 2022-05-28 POINT (36.633 -3.368)
## 13486 637900 99999            MOSHI   TZ       HTMS   831.0 1949-09-09 2022-05-23  POINT (37.333 -3.35)
## 13487 637910 99999 KILIMANJARO INTL   TZ       HTKJ   893.7 1973-01-01 2022-05-28 POINT (37.074 -3.429)

Downloading data

Ideally, you have now found an appropriate station you would like to acquire data from. It usually takes some patience to click through the download procedure inherent to the GIS Data Locator, and having to perform the same steps again and again can soon get quite frustrating. I implemented a function called dlGsodStations that helps to overcome these inconveniences. The function works with USAF codes that are unique to each GSOD station and, in my opinion, catchier than the associated station names. The USAF code can be determined by having a look at the outcome of the various station selection functions.

# Subset station list by name, and display related USAF code
moshi <- subset(
  gsodstations
  , `STATION NAME` == "MOSHI"
)
head(moshi)

##         USAF  WBAN STATION NAME CTRY STATE ICAO   LAT    LON ELEV(M)      BEGIN        END
## 13486 637900 99999        MOSHI   TZ       HTMS -3.35 37.333     831 1949-09-09 2022-05-23

If you are not willing to download the entire dataset from a given station (which is the default setting), but rather a limited period of time, you can specify the desired years through start_year and end_year. It is possible to adjust the destination folder for data download (dsn) if you do not want to save the data in the current working directory. Furthermore, you can also determine whether to extract the zipped files and remove the .gz files after extraction. In this case, the function returns a data.frame rather than a vector of .gz filepaths. Note that, for each year and station, the function automatically checks if data is already present in the specified destination folder. If so, the referring download will be skipped and the subsequent download query will be evaluated. Here is an example on data download including visualization of the daily mean air temperature from Moshi, Tanzania, for the years 1990 to 2000.

# Not run: Download data from Moshi, Tanzania, from 1990 to 1995
gsod_moshi <- dlGsodStations(usaf = moshi$USAF,
                             start_year = 1990, end_year = 1997,
                             dsn = tempdir(),
                             unzip = TRUE)

# Plot temperature data (but: time series not continuous!)
library(ggplot2)

# Convert temperature column
gsod_moshi$TEMP <- toCelsius(gsod_moshi$TEMP, digits = 1)

ggplot(aes(y = TEMP, x = YEARMODA), data = gsod_moshi) + 
  geom_line(color = "grey50") + 
  stat_smooth(method = "lm") + 
  labs(x = "Index", y = expression("Temperature (" ~ degree ~ C ~ ")")) +
  theme_bw()

Side note: toCelsius

You may have already noticed toCelsius in the preceding code chunk. Indeed, this function is as small as useful, as it converts temperature values from degree Fahrenheit (which is native GSOD format) to degree Celsius.

# RNG
set.seed(10)

# Degree Fahrenheit
val_fah <- runif(10, 60, 80)
# -> Degree Celsius
toCelsius(val_fah, digits = 1)

##  [1] 21.2 19.0 20.3 23.3 16.5 18.1 18.6 18.6 22.4 20.3