RNAcentral / rnacentral-import-pipeline

RNAcentral data import pipeline
Apache License 2.0
2 stars 1 forks source link
nextflow postgresql python

RNAcentral data import pipeline

About

This is the main pipeline that is used internally for loading the data into the RNAcentral database. More information. The pipeline is nextflow based and the main entry point is main.nf.

The pipeline is typically run as:

nextflow run -profile env -with-singularity pipeline.sif main.nf

The pipeline is meant to run

Configuring the pipeline

The pipeline requires a local.config file to exist and contain some information. Notably a PGDATABASE environment variable must be defined so data can be imported or fetched. In addition, to import specific databases there must be a params.import_data.databases dict defined. The keys must be known databases names and the values should be truthy to indicate the databases should be imported.

There is some more advanced configuration options available, such as turning on or off specific parts of the pipeline like genome mapping, qa, etc.

Using with Docker

The pipeline is meant to run in docker or singularity. You should build or fetch a suitable container. Some example commands are below.

Testing

Several tests require fetching some data files prior to testing. The files can be fetched with:

./scripts/fetch-test-data.sh

The tests can then be run using py.test. For example, running Ensembl importing tests can be done with:

py.test tests/databases/ensembl/

Other environment variables

The pipeline requires the NXF_OPTS environment variable to be set to -Dnxf.pool.type=sync -Dnxf.pool.maxThreads=10000, a module for doing this is in modules/cluster. Also some configuration settings for efficient usage on EBI's LSF cluster are in config/cluster.config.

License

See LICENSE for more information.