BgeeDB / bgee_pipeline

Source code of the Bgee pipeline used to build the Bgee database
https://bgee.org/
Creative Commons Zero v1.0 Universal
11 stars 4 forks source link
affymetrix biology est expression gene gene-expression in-situ-hybridization java makefile perl pipeline python rna-seq transcriptomics

Pipeline of Bgee release 15.1

Twitter Mastodon

Bgee pipeline documentation: content

General information

  1. Introduction
  2. Directory structure
  3. Pipeline steps

Developer guidelines

  1. Documenting the pipeline
  2. Writing Makefiles
  3. Versioning input/output files
  4. To do at each release of the pipeline

General information

Introduction

Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).

Bgee is based exclusively on curated "normal", healthy, expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression.

Bgee produces ranked calls of presence/absence of expression, and of differential over-/under-expression, integrated along with information of gene orthology, and of homology between organs. This allows comparisons of expression patterns between species.

Bgee pipeline overview

Directory structure:

Pipeline steps

Each step of the pipeline is represented by a sub-directory in the pipeline/ directory. See pipeline/README.md for description of the steps, and configuration.

Developer guidelines

  1. Documenting the pipeline
  2. Writing Makefiles
  3. Versioning input/output files

Documenting the pipeline

The pipeline is documented directly in the relevant directories of the pipeline, through README.md files. See pipeline/ directory as a starting point.

Recommended sections:

Writing Makefiles

Mandatory variables and import

Two variables that are used in pipeline/Makefile.common must be defined by each Makefile:

These variables must be defined before importing pipeline/Makefile.common. They allow to define useful common variables from pipeline/Makefile.common, e.g.: VERIFICATIONFILE, INPUT_DIR, OUTPUT_DIR.

Example start of a Bgee Makefile in pipeline/species/:

PIPELINEROOT := ../
DIR_NAME := species/
include $(PIPELINEROOT)Makefile.common

Step verification file

The all target of the Makefile must always be to generate a step verification file, whose path is accessible through the variable $(VERIFICATIONFILE). The all target should either be the first target defined in the Makefile, or be assigned as the default goal (i.e., .DEFAULT_GOAL := all).

This verification file must contain information allowing to easily assess whether all the targets of the Makefile were successfully run. For instance, to generate this file, a Makefile could launch a query to the database to verify correct insertion of data.

Example Makefile targets:

    all: $(VERIFICATIONFILE)

    ...

    $(VERIFICATIONFILE): dependency1 dependency2 ...
        @$(MYSQL) -e "SELECT * FROM taxon where bgeeSpeciesLCA = TRUE order by taxonLeftBound" > $@.temp
        @$(MV) $@.temp $@

Input/output folders

In order to more efficiently save input files and files generated by a pipeline run, they are kept in specific folders, not mixed with pipeline scripts, in source_files/ and generated_files/. The directory structure in these folders should be the same as in the pipeline/ directory.

No Makefile should read directly from or write directly into the pipeline/ folder.

Common variables

When a file name, or URL, etc, is used in several Makefiles, it should be assigned to a variable in pipeline/Makefile.common. Notable variables:

Secured variables

If a variable contains sensitive information (e.g., a password), it should be defined in pipeline/Makefile.Config. The actual values of these variables should not be versioned! (simpler than to encrypt the file)

Versioning input/output files

Source and generated files should be versioned using git, if not too large. This versioning is not performed automatically by the Makefiles. It is the responsibility of the person in charge to version the relevant files when a step is completed.

To do at each release of the pipeline

Branch master always reflects the pipeline for the current release of Bgee, develop is the pipeline in development for the next version of Bgee. When you do a new release: