Pandora-IsoMemo / iso-data

ETL for IsoMemo Database
https://pandora-isomemo.github.io/iso-data/
GNU General Public License v3.0
1 stars 0 forks source link

IsoMemo Data Package (iso-data)

Infrastructure

Infrastructure

Content

This ReadMe contains instructions on how to:

Add a New Data Source

There are two ways to add a new data source depending on where data is retrieved from:

  1. data source retrieved from a mySql database -> execute function createNewDBSource()
  2. data source retrieved from a local or remote file -> execute function createNewFileSource()

Executing one of the functions

will automatically:

  1. create a new file R/02-<datasource>.R that contains the function to extract the data: extract.<datasource>(),
  2. add a new entry for the source into the file R/00-databases.R,
  3. (only for mySql databases) create/update the .Renviron file that contains database credentials.

The Files R/02-<datasource>.R for different data sources may contain individual and extensive data preparations that can be adjusted manually. For details compare e.g. R/02-LiVES.R, and read the section Modify An Existing Data Source.

Specify the type of data and the data mapping

For both ways to add data sources (from database or from file), four mandatory parameters must be specified:

Specify the data source

MySql database:

Here, database credentials <dbName>, <dbUser>, <dbPassword>, <dbHost>, <dbPort> and the <tableName> must be specified. The credentials are not to be stored on Github, they will not be stored in any file that will be uploaded to Github. The credentials are only needed for local development and for testing the database connection.

createNewDBSource(dataSourceName = <datasource>,
                  datingType = <datingType>,
                  coordType = <coordType>,
                  mappingName = <mappingName>,
                  dbName = <dbName>,
                  dbUser = <dbUser>,
                  dbPassword = <dbPassword>,
                  dbHost = <dbHost>,
                  dbPort = <dbPort>,
                  tableName = <tableName>)

File:

Data can be loaded either

Please set <location> = "local" in the first case, and <location> = "remote" in the second case.

Please, provide the <filename> with extension (only *.csv or *.xlsx are supported), e.g. "data.csv", "14SEA_Full_Dataset_2017-01-29.xlsx"

Optionally, the following can be specified

createNewFileSource(dataSourceName = <datasource>,
                    datingType = <datingType>,
                    coordType = <coordType>,
                    mappingName = <mappingName>,
                    fileName = <filename>,
                    locationType = <location>,
                    remotePath = <remotePath>,
                    sheetNumber = 1,
                    sep = ";",
                    dec = ",")

Modify an Existing Data Source

Data extraction for all data sources are defined in the files R/02-<datasource>.R. Within the function extract.<datasource>() you can retrieve data, modify values as you like. You only need to ensure these points:

A minimal example of the extract function looks like this

extract.testdb <- function(x) {
    dat <- mtcars # dummy dataset

    x$dat <- dat # assign data to list element x$dat

    x # return x
}

ETL process of the Data Sources

Test the ETL process

Run the following commands in R to install the package locally and run the extract function.

devtools::install() # install package locally
devtools::load_all() # load all functions from package

res <- etlTest()

Inspect the results in test. Data from the nth datasource will be in the element res[[n]]$dat

IMPORTANT: Only 5 rows will be processed during the test! If you want to process all data specify full = TRUE:

res <- etlTest(full = TRUE)

To test only the n-th datasource execute the function like this

res <- etlTest(databases()[n])

Results will be in the object res[[1]]$dat

Test the Code

Test your code by running

devtools::check()

Deployment

Code from the main branch will be automatically deployed to the production system on the MPI server (given successful devtools::check()) and will be on the main version of the API and App.

Respectively, code from the beta branch will be automatically deployed to the beta version of the API and App.

Access to Data

data is returned in JSON format

You can use the following parameters:

Example call:

https://isomemodb.com/api/v1/iso-data?dbsource=LiVES&category=Location&field=site,longitude

Helper endpoints

For the production api use /api instead of /testapi