SPI-Birds / pipelines

Pipelines for generating a standard data format for bird data
2 stars 6 forks source link

 SPI-Birds Network and Database: Pipelines 

    Blog    Email    Twitter    

 Welcome to the pipeline repository for the SPI-Birds Network and Database. Follow the links above to visit our website, contact us via e-mail or Twitter. This README contains all the details required to work with the pipelines, including workflow guidelines for developers. 


Table of Contents (general user) [Load the pipeline package](#load) [Pipeline documentation](#docs) [Run the pipelines for yourself](#run)
Table of Contents (developers guidelines) [Data storage conventions](#storage) [Naming conventions](#naming) [Recommended workflow](#workflow) [Data requests](#requests) [Archiving](#archiving) [Quality check](#quality_check)

SPI-Birds pipeline: Introduction (for the general user)

Welcome to the SPI-Birds pipeline package. This section of the README will give you an introduction on how to load the package, how to find out details about each pipeline, and how to use the package for creating bird data following the SPI-Birds community data standard and generating standard quality checks.

Load the pipeline package

The pipeline package can be installed in R using the following code with the package remotes:

remotes::install_github("SPI-Birds/pipelines")
library(pipelines)

This will install all pipelines and quality check code on your computer and attach our pipeline package into your session of R. Individual pipelines are build as separate functions for each data owner (where one data owner can administer multiple populations). Each function is given the name format_X() where X is the letter code for the data owner. The codes for different data owners and corresponding populations are described in the SPI-Birds standard protocol. Note in cases where a data owner administers one population, the unique 3 letter population ID code and the data owner code are identical.

Pipeline documentation

To process each set of primary data into the structure described in the SPI-Birds standard protocol it is often necessary to make assumptions about how each variable is interpreted. All assumptions made during the pipeline process are described in the help documentation for a given function. This can be accessed using the ? in R. For example, to read about the assumptions made when processing data from the NIOO, you can use the code:

?format_NIOO

Run the pipelines for yourself

Each set of primary data is in a slightly different format. Therefore, to run all pipelines successfully, your system will require additional software and drivers (in addition to R).

Setup your computer to run pipelines

Pipelines for some populations require additional software and drivers. Setup instructions describe how to install the required software for both a Windows 10 64-bit operating system and Mac OSX. The setup procedure should be similar for other Windows 64-bit systems. If you are unsure which version of Windows is running on your computer, check 'System Type' in 'System Information'. To run the pipelines for all populations a users system must have:

  • Microsoft Access Driver (/.mdb, /.accdb) (Windows only)
  • Python 3
  • Python libraries pandas and pypxlib

Note Users running Mac OSX will not be able to run pipelines with primary data stored in Microsoft Access format without purchasing paid drivers.


Windows 10 64-bit

Microsoft Access Driver

Firstly, you must check that you are running a 64-bit version of R. Open an R session and see whether you have 64-bit or 32-bit installation.

If you do not have a 64-bit version you will need to install one here.


Once you have a 64-bit version of R, search for 'ODBC' in the Windows taskbar. There will be two version (32-bit and 64-bit) select the 64-bit version. This will open the 'ODBC Data Source Administrator (64-bit)' window.

In the new window check for 'Microsoft Access Driver'. If this already exists you can skip to the Python stage.

If 'Microsoft Access Driver' does not exist click 'Add' to install a new driver on your system.


Select 'Microsoft Access Driver (/.mdb, /.accdb)' and click finish.

If 'Microsoft Access Driver (/.mdb, /.accdb)' does not appear, you will need to download the 64-bit driver here

In the next window, you must add a 'Data Source Name'. Everything else can be left blank.

Check if this driver is installed and recognised by R using the function odbcListDrivers() from the odbc package. Note that you will need to open a new session of R before the driver will appear.

Python 3

To install Python, we recommend using the Anaconda distribution. Make sure to download the 3.X version of Python. The Anaconda distribution comes with some libraries (including pandas) pre-loaded.

Once installed, open the 'Anaconda prompt' and type:

pip install pypxlib

This will install the pypxlib library on your system.

Restart your computer before running the pipelines.

MikTex

To generate the pdf quality check report on Windows you will need to have installed MikTex. If MikTex is not installed, only the html version of the quality check report can be created.

An alternative LaTeX distribution that works well in R is TinyTeX.

Mac

Microsoft Access Driver

At present, no free Microsoft Access Driver is available for Mac.

As a consequence, the pipelines package currently does not run pipelines requiring a Microsoft Access Driver on Mac OSX (the affected pipelines are skipped and a information message displayed when attempting to run on Mac).

Python 3 for Mac

The following notes detail how to set up the python environment on MacOS, including necessary libraries:

With this setup, python should be good to go for extracting paradox databases. (Note that when you install Anaconda, the r-reticulate environment should already be present. If that is not the case, you may have to first generate the environment and link it to RStudio).

Pdf compilation on Mac

At present, the pipelines package does not create pdf outputs when run on a Mac. This is work in progress and will be changed in the future.

Troubleshooting

If you are still unable to run the pipelines following these setup instructions try these troubleshooting techniques:


Running the pipelines

Once your computer is set up and primary data follow the correct naming protocol you can run the pipeline function. R will ask you to select the folder where the primary data are stored. You can decide on the output create by the pipeline using the argument output_type, which can be either "csv" (as separate .csv files, the default) or "R" (as an R object).

format_NIOO(output_type = "R")

If you want to run multiple pipelines at once, you can use the run_pipelines() function instead.

Developer guidelines

Data storage conventions

The N drive data folder

All files relevant to SPI-Birds are stored in the N drive on the NIOO server (N:\Dep.AnE\SPI_Birds\data). This data folder contains separate folders for every data owner in the format X_Name_Country, where X is the data owner code, Name is the name of the data owner, and Country is the country where the data owner is based. For example, the NIOO data are stored in the folder:

NIOO_NetherlandsInstituteOfEcology_Netherlands

Data owner folders

The folder for each data owner will contain all relevant information for all populations administered by the data owner. This will usually include:

  • Primary data
  • Meta data
  • Archive meta data
  • The archive folder

The naming convention of each of these files is described below.

The .standard_format folder

In addition to folders for each data owner, the data folder contains the most recent output of all pipelines in the standard format, including an archiving folder. When a data request is made, this version of the standard format can be filtered to meet a given data request (see Data requests below). This is more efficient than re-running pipelines for each data request.

Naming conventions

All files used to run pipelines and store data should follow the standard naming convention.

Primary data

Primary data should be named with the format X_PrimaryData_Y. Where X is the data owner code (described above) and Y is additional information used to distinguish between multiple primary data files. For example, the a data owner ABC may have separate primary data files for great and blue tits. These files might then be named:

ABC_PrimaryData_Greattit.csv
ABC_PrimaryData_Bluetit.csv

Meta data

All data owners should also provide meta-data about their population(s) in an .xslx file with the format X_MetaData.xlsx, where X is the data owner code.

Archive meta data

The folder of each data owner will also include an archive meta data .txt file (the archiving process is explained in more detail below). This file will be in the format X_ArchiveMetaData.txt, where X is the data owner code.

Additional files

The data owner may provide other files (e.g. field work protocols, relevant papers). The possible types of files here is unrestricted, to the naming convention must be more flexible. Files can contain any information and be of any file type; however all files should start with the data owner code. For example, the field protocol for data owner ABC may be stored as:

ABC_FieldProtocol.docx

Pipelines

Code of all pipelines is stored in the /R folder of the pipelines repository. Every pipeline file should follow the naming convention format_X.R, where X is the data owner code. More details on the structure of pipeline code can be found below.

Recommended developer workflow

Below we describe the workflow that any developer should follow when building a new pipeline.

Before starting

  • Contact the data owner and let them know you have started to work on their data. At this point, it is usually helpful to ask about any changes or additions that may have been made to the primary data since it was first included in the SPI-Birds database.

  • Update the SPI-Birds Google Sheet and list the pipeline as 'in progress'.

Create a new branch

  • Pull the newest version of the master branch (git pull).

  • Create a new branch from the master where you will build your pipeline (git checkout -b new_branch_name). Make sure the branch name is clear and concise.

  • As you work, you should stage (git add format_X.R) and commit (git commit -m 'commit header' -m 'commit details') your work regularly.

Note Commits should ideally be distinct blocks of changes with a concise header and detailed description. See some commit best practices here.

Build the pipeline

Note: We recommend you look at other pipelines as a guide.

Create unit tests

Every pipeline should have a set of unit tests in the /test/testthat folder using the testthat package.

test_pipeline()

Once you have finished the pipeline and written relevant unit tests you should make sure these tests pass.

devtools::check()

Once your branch is passing all unit tests you should next check the package structure. This will more broadly check things like the documentation, check for unreadable characters, ensure all the code can be loaded. This will not re-run the pipeline unit tests, which are skipped at this stage.

Imports includes 27 non-default packages.
Importing from so many packages makes the package vulnerable to any of
them becoming unavailable.  Move as many as possible to Suggests and
use conditionally.

Package dependencies are discussed in more detail below.

Tips for passing devtools::check()

"no visible binding for global variable"

This will often occur when working with dplyr code. All references to columns in a data frame should be prefixed by .data$.

"no visible global function"

All functions except those in the base package should have the package namespace explicitly stated (e.g. stats::na.omit).

"Undocumented arguments in documentation object 'XXX'"

The function XXX includes an argument that is not documented with @param in the roxygen2 documentation code. Check for spelling mistakes!

"Documented arguments not in \usage in documentation object 'XXX'"

The function XXX includes documentation for an argument in @param in the roxygen2 documentation code that does not appear in the function. Check for spelling mistakes!

"Found the following file with non-ASCII characters"

Packages can only include ASCII characters. You can check the character types being used in a line of code with stringi::stri_enc_mark(). For example:

#Will return ASCII
stringi::stri_enc_mark("ABC")

#Will return UTF-8
stringi::stri_enc_mark("是")

Watch out for cases where slanted quotation marks are used (‘’) instead of straight ones ('')! Slanted quotation marks can often be introduced when copying text from outside R, but they are NOT ASCII.

If a non-ASCII character must be used, it can be encoded with unicode \uxxxx.

Create a pull request

Once your pipeline is stable and passes all tests and checks it should be reviewed by other developers.

Note One key aspect of the code review should also be to test the pipelines on both Mac OSX and Windows.

Note The pull request should not be merged until after the data owner confirmation.

Data owner confirmation

The code review should ensure that there are no major bug or oversights. At this point, we can contact the data owner to discuss the pipeline.

Merge and quality check

Note Remember to pull the newest version of the master branch at this point, it will include the new pipeline.

Update and archive

Data requests

  • A data request will specify the PopIDs and Species of interest. We can return the relevant data in the standard format by running subset_datarqst() on the most recent version of the standard format in the .standard_format folder.

  • You can choose to include or exclude individuals where the species is uncertain using the include_conflicting argument (FALSE by default).

  • Run quality_check() on the subset of the standard format.

  • Provide the user with the subset of the standard format and the quality check report.

Archiving

Archiving a new population

  1. Create a new folder in N:\Dep.AnE\SPIBirds\data. It should follow the syntax `_`
  2. Rename files.
    • Primary data should follow the syntax <OWNERID>_PrimaryData. If there are multiple primary data files provide different suffixes to differentiate them (e.g. <OWNERID>_PrimaryData_GTData
    • Population meta-data should follow the syntax <OWNERID>_MetaData
    • All other data that is not meta-data or primary data can be named in any way, but should always start with <OWNERID>_
  3. Create the initial archive. The below code will generate a ArchiveMetaData.txt file and generate an archive folder for the new population. Important: Make sure you specify that this is the initial archive with initial = TRUE.
    archive(data_folder = "N:\Dep.AnE\SPI_Birds\data", OwnerID = <OWNERID>, new_data_date = <DATE WHEN DATA WERE RECEIVED>, initial = TRUE)

Archiving updated data

  1. Rename new files to match existing data files (i.e. with the syntax <OWNERID>_PrimaryData). Important: Files should have the exact same name, otherwise the pipelines may break. If you do need to use new file names (and rework the pipeline) you will be given a prompt to continue.
  2. Decide if we are dealing with a 'minor' update (e.g. fix typos) or a 'major' update (e.g. new year of data).
  3. Run archiving code:
    archive(data_folder = "N:\Dep.AnE\SPI_Birds\data", OwnerID = <OWNERID>, update_type = <"major"/"minor">,
        new_data_path = <LOCATION OF NEW FILES. Consider using choose.files()>,
        new_data_date = <DATE WHEN DATA WERE RECEIVED>, initial = FALSE)

Archiving standard format data

THIS IS STILL DONE MANUALLY AND NEEDS TO BE UPDATED. EVERY TIME A NEW PIPELINE IS FINISHED WE SHOULD ADD THE NEWEST VERSION OF THE STANDARD FORMAT IN .standard_format AND ALSO IN A NEW FOLDER .standard_format/archive/.

Quality check

Note: the quality check is built for pipelines tailored to version 1.0.0 and 1.1.0 of the standard format. Updating the quality checks to match pipelines tailored to version 2.0.0 of the standard format is in progress.

Creating checks

The quality_check() function is a wrapper function that combines 4 dataset-specific wrapper functions:

  • brood_check()
  • capture_check()
  • individual_check()
  • location_check()

Each of the dataset-specific functions contains a series of individual quality check functions. These individual quality check functions should be named ‘check’ or ‘compare’ followed by a short description of the check and come with a CheckID (e.g. B2 is the second individual check within the brood_check() wrapper).

All individual checks should function on rows and flag records as ‘warning’ (unlikely values) or ‘potential error’ (impossible values).

Approve-listing

Approve-listed records (i.e. flagged records that are subsequently verified by data owners) should not be flagged by the checks.

If the data owner verifies any records flagged by the quality check (i.e. classifies them as legitimate values) add them to brood_approved_list.csv, capture_approved_list.csv, individual_approved_list.csv or location_approved_list.csv.

Running quality check

The quality check is run on data in the standard format using quality_check().

The output of the quality check includes:

  • A summary table of which checks resulted in warnings and potential errors
  • The pipeline output, where each table of the standard format includes two additional columns (Warning and Error) marking the records that resulted in warnings and potential errors
  • A report (in html and/or pdf) with a description of all checks and a list of all warnings and potential errors that have been flagged in the pipeline output.

Troubleshooting

If you have any issues with running the quality check, try these troubleshooting tips:

  • Often pipelines make use of several grouping structures (inserted by dplyr::group_by() or dplyr::rowwise()). Removing these groups (by dplyr::ungroup() or dplyr::summarise(..., .groups = "drop")) reduces the run time of the quality check considerably.
  • If you have trouble creating the pdf, try setting the LaTeX engine to LuaLaTeX (i.e. quality_check(..., latex_engine = "lualatex")).