BigelowLab / referee

Scripts for building reference databases
MIT License
0 stars 1 forks source link

referee

Scripts for building reference databases for Maine eDNA project

Here we store scripts and ancillary functions for building reference databases. Data is stored elsewhere.

All scripts are run with a configuration input - we have chosen YAML as the configuration format.

Code

All coding environments can, for now, be established by using source() on a special setup script cleverly named setup.R. You must create a special hidden file that helps this code suite know where to find source code. Use this command in R to set that up for your user.

cat("/some/path/to/the/referee/folder", sep = "\n", filename = "~/.referee")

Obviously you will want to edit the path specification. If you are operating on charlie it will be /mnt/storage/data/edna/packages/referee.

Workflow scripts

Running workflow scripts

Some of the workflows take a long time (days), so you'll want to be able to invoke the workflow and let it run in the background. You have two options.

Use screen

screen is a utility available on linux and macos that allows you to start a processes in a shell, detach from the running process, and then reattach at some later time. In that case you kick off a workflow script within your shell with the familiar call toRscript.

$ Rscript /path/to/worflow/script.R /path/to/config/file.yaml

When available use a scheduler like PBS

Scheduling software is generally available on High Performance Computing (HPC) platforms. We have created a shell script to manage the numerous configurations for each workflow. The shell script makes sure the compute environment has the resources it needs to run. Note that Bigelow's PBS configuration may be sloghtly different than your own. Roughly speaking, you submit a request to enter the workflow into the queue, provide any named optionam arguments followed by the name of the shell script. The shell script will call Rscript (as shown above) after it completes the session setup.

$ qsub -v cfgfile=/path/to/config/file.yaml /path/to/worflow/script.sh

Code organization

Scripts live at the top level (possibly witha companion PBS submission shell script). Each of these source the setup.R file which handles package installation and sourcing ancilalry functionality. Ancillary functionaltiy resides in the the functions subdirectory.

Script organization

Each script is organized roughly like this to make interactive development easy (on the developer!) by providing a default input configuration file path and burying the code steps in a main() function that accepts just one argument, the configuration list.

main = function(cfg){
  charlier::info("starting version: %s", cfg$version)
  # do stuff here
  return(0 or non-zero)
}

source("setup.R")
args = commandArgs(trailingOnly = TRUE)
cfgfile = if (length(args) <= 0)  "/some/path/to/a/default/config.yaml" else args[1]
cfg = refdbtools::read_configuration(cfgfile)
charlier::start_logger(filename = file.path(cfg$output$path, cfg$version))
if (!interactive()){
  ok = main(cfg)
  charlier::info("done!")
  quit(save = "no", status = ok)
}

Fun fact! Ben would like it if all true scripts used the extension .Rscript instead of .R. But nobody asked Ben.

Function organization

Functions are small reueseable bits of code (atoms!) that are assembled into a logical order in the scripts (molecules!). This style of coding is often called modular or atomistic (hence the atoms-molecule paradigm). 99% of the time functions do not invoke the library or require function - put any of those in either the setup.R script (preferred) or under rare circumstances in the script you are running. Currently, functions are gathered into text files, with the .R extension, into groupings around either a workflow/task or around a particular package.

Function documentation

We use the Roxygen style of documenting functions. This style has us prepending documentation lines with #' and using the @keyword style of flagging arguments, ntes, return values, etc. Starting this way makes for very consistent documentation and an easy transition to building a package. Here is an example...

#' Like isTRUE and isFALSE but for vectors
#' 
#' @param x logical vector with or without NAs
#' @param na.rm logical, if TRUE remove NAs before consideration
#' @return logical vector
is_false = function(x = c(TRUE, FALSE, NA), na.rm = FALSE){
  sapply(x, isFALSE)
}

Vocabulary

Accession ID: more here specifically about GenBank

Taxa ID:

Restez dastabasii: plant, vertebrate, invertebrate, mammalian and rodents