genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

Pangenome: RNA-seq Pipeline Requirements #1055

Open jasonwalker80 opened 2 years ago

jasonwalker80 commented 2 years ago

Determine the requirements and potential options for the basis of the Pangenome Annotation RNA-seq pipeline.

chad388 commented 2 years ago

The pipeline for evaluation of illumina RNA-seq data that Xiouyu expressed interest in is the Nextflow based nf-core/rnaseq pipeline available in this repo (https://github.com/nf-core/rnaseq)

This software is designed to use Nextflow and to run on a container (Docker in our case) for maximum reproducibility. Conda can also be used to setup the software dependencies, but this approach is discouraged.

Quick Start -Install Nextflow (>=21.04.0) -Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs). Note: This pipeline does not currently support running with Conda on macOS if the --remove_ribo_rna parameter is used because the latest version of the SortMeRNA package is not available for this platform.

Download the pipeline and test it on a minimal dataset with a single command: nextflow run nf-core/rnaseq -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>

There is a user-friendly GUI available at the URL below for specification of parameters to use when running the pipeline: https://nf-co.re/launch?id=1635975833_2626a88494eb

A custom configuration file can be created to define generic settings our compute environment https://github.com/nf-core/configs#using-an-existing-config

These customer configuration files can be hosted within the nf-core github repo. Some example of customer configuration files used by other institutions can be found on this page: https://github.com/nf-core/configs/tree/master/conf

We discussed potentially setting up this pipeline to run, as is, using Nextflow, and then potentially converting this entire pipeline or components of this pipeline into a WDL custom pipeline.

Other RNA-seq pipelines, which we could potentially use either whole or in part, are below. These pipelines are already in wdl or cwl format:

The MGI rnaseq pipeline: https://github.com/genome/analysis-workflows/blob/master/definitions/pipelines/rnaseq.cwl

The ENCODE-DCC/rna-seq-pipeline: https://github.com/ENCODE-DCC/rna-seq-pipeline/blob/dev/rna-seq-pipeline.wdl

Xiaoyu seems to favor the use of the nf-core/rnaseq pipeline. He prefers the use of Trim Galore for adapter trimming. We could likely package and incorporate Trim Galore into one of these other pipelines as well.

We can discuss the best approach, but we likely need to get started on something soon.

I feel that setting up and testing the nf-core/rnaseq pipeline might be the best initial approach and then perhaps we can develop a custom workflow based upon some of the components from this pipeline as well as components from the MGI rnaseq and ENCODE-DCC pipelines, but I am open to other ideas to make this easier. I can certainly go back to Xiaoyu and ask him if he is open to using one of these other two pipelines that are already in cwl or wdl format.