AustralianAntarcticDataCentre / raadsync

Other
2 stars 0 forks source link

clean up raadtools/raadsync initialization #24

Open mdsumner opened 8 years ago

mdsumner commented 8 years ago

I discovered a few gotchas setting up a test local repo

I consider this raadsync territory, not to pass the buck but because raadsync needs to own the collection - it might be used by other toolkits, and it needs all this. Consider this issue a placeholder for assigning tasks.

Also the following is a good example to promote with.

raadsync from scratch

At the Australian Antarctic Division we maintain a collection of publically available data sets for general use. To maintain and read these data we use the R for Australian Antarctic Division (RAAD) packages raadsync and raadtools.

This document aims to describe the capabilities of the tools used to build and maintain and use these data sets as well as highlight the exciting new interactive visualizations provided by mapview.

A good example data set is the NSIDC 25km passive microwave sea ice concentration. Here we

The RAAD packages may be installed with devtools from Github (please note these are from different GitHub repositories).

devtools::install_github("AustralianAntarcticDivision/raadtools")
devtools::install_github("AustralianAntarcticDataCentre/raadsync")

Data repository - administrator task

Register a location for the data to be stored locally, this can be anywhere that is writable by the maintainer of the collection.

NOTE Before running this code, please be aware that these tools are designed to download very long time-series of dozens of data sets. They can pull down very many gigabytes of files from only one collection, and for example we have several terabytes because we tend to register all the ones available and have them all completely available and up to date. Not everybody can do this! It's for shared resources at a large research institute.

That said, the sea ice concentration data is relatively small and can be obtained on its own. The download below is ...

Here we put a new file in our local user directory and set it as the default location understood by RAAD. This should be a shared location for general usage, but for this example to be as widely useable as possible a local user installation is reasonable.

dfd <- normalizePath("~/raad", winslash = "/")
dir.create(dfd)
options(default.datadir = dfd)

Load raadsync, on the first time you need to confirm a setting for caching if this is being done interactively. Just enter "Y".

library(raadsync)

Read the built-in default config file and process it for only the sea ice data of interest. We use "NULL" for the local config as we are not overriding any defaults.

cfg <- read_repo_config(local_config_file = NULL)

Explore this configuration data set.

Nothing is set to synchronize.

any(cfg$do_sync)

What data sets are about "ice"?

grep("ice", cfg$name, ignore.case = TRUE, value = TRUE)

We want the NSIDC SMMR-SSMI/I Nasateam sea ice concentration, though we will be selective and only obtain the southern hemisphere and a recent time series to save time and storage. These data are excellent for exploration as they are relatively low-volume data, delivered in a straightforward binary format on native Polar Stereographic map projection, with complete spatial coverage for a complete daily time series from 1979 to now. Obviously the two poles northern and southern are stored separately, we exclude the north by default here.

dnames <- c("NSIDC SMMR-SSM/I Nasateam sea ice concentration", 
            "NSIDC SMMR-SSM/I Nasateam near-real-time sea ice concentration" )

myconfig <- subset(cfg, name %in% dnames)
myconfig$local_file_root  <- file.path(dfd, "data")
myconfig$do_sync <- TRUE

Investigate the download options and modify to suit, we only want recent data and the southern hemisphere.

myconfig$method_flags

## only this year and last year (probably on 2015 is available for "final" anyway)
myconfig$method_flags[1] <- 
paste(myconfig$method_flags[1], "--accept=\"*nt_2016*\"", "--accept=\"*nt_2015*\"")

## only this year for near-real-time
myconfig$method_flags[2] <- 
paste(myconfig$method_flags[2], "--accept=\"*nt_2016*\"")

Synchronize away, please note that this process is time consuming as it thoroughly checks the remote and local sources, including hash signatures for changed files when possible.

sync_repo(myconfig)

Build the file list cache, this is a convenience mechanism for read functions to save scanning the file system. Administrators, please note that the synchronization and file list caching may be set up as routine system jobs to keep everything up to date.

dir.create(file.path(dfd, "admin"))
dir.create(file.path(dfd, "admin/filelist"))
fs1 <- list.files(file.path(getOption('default.datadir'), 'data'), all = TRUE, recursive = TRUE, full.names = TRUE, no.. = TRUE)
fs1 <- normalizePath(fs1, winslash = "/")

fs <- gsub(paste0(getOption('default.datadir'), "/"), "", fs1)
save(fs, file = file.path(dfd, 'admin', 'filelist', 'allfiles2.Rdata'))
writeLines(fs, file.path(dfd, 'admin', 'filelist', 'allfiles2.txt'))

Now check what we have in terms of files.

library(raadtools)
icf <- icefiles()
range(icf$date)
## size of collection is pretty small given our set limits above
sum(file.info(icf$fullname)$size)/1e6

Read the data and plot!

ice <- readice(icf$date[seq(1, nrow(icf), by = 14)])
ice <- readice(subset(icf, date >= as.POSIXct("2015-06-09"))$date)
library(mapview)
cubeView(ice)
raymondben commented 8 years ago

Yah, the current NSIDC configuration includes southern hemisphere only but that isn't reflected in the dataset name. Will fix ...