FredHutch / invadeseq

Analysis of combined 10X human and microbial data
MIT License
2 stars 0 forks source link

INVADEseq

Analysis of combined 10X human and microbial data

Input Data

To run INVADEseq, the user must provide single-cell sequencing data produced by the 10X Chromium platform, both for gene expression (GEX) and for 16S-enriched sequences. In order to specify the location of the input data, the user will construct a manifest file listing the paired datasets which were produced from each sample.

The inputs to this workflow will be the FASTQ files output from cellranger mkfastq. After running cellranger mkfastq, the FASTQ files which are produced will be tagged with the sample names used when preparing the samples for the 10X Chromium platform. All of the FASTQ files being used as inputs must be contained at some level within a shared directory.

The format of the manifest file will be a CSV with the column names sample, gex, and microbial. The values provided in the gex and microbial columns will be the dataset IDs which were used for the same biological source sample.

For example:

sample,gex,microbial
sampleA,sampleA_gex,sampleA_microbial
sampleB,sampleB_gex,sampleB_microbial
sampleC,sampleC_gex,sampleC_microbial

The manifest file must be provided to the workflow using the parameter manifest.

The root folder which contains all of the FASTQ files used in the analysis must be provided with the parameter fastq_dir. Note that any files with the extension .fastq.gz can be used in the analysis, even if they are nested within additional subfolders.

Reference Data

The user must provide reference databases for both the CellRanger-compatible transcriptome as well as the PathSeq database.

Those databases are provided using the parameters:

Running the Workflow

Nextflow

The workflow can be run using the Nextflow workflow management system, which can be set up following their user documentation.

Containers

The software used in each step of the workflow has been provided via Docker containers which are specified in the workflow. Those software containers can be used either via Docker (installation instructions) or Singularity (installation instructions). Singularity is typically used on HPC systems which do not allow users the root access needed for running Docker.

After either Docker or Singularity, Nextflow must be configured to use either system as appropriate. The most convenient way to set up this configuration is to create a file called nextflow.config which follows the configuration instructions for Nextflow. It is also possible to set up other types of job execution systems (e.g. AWS, Google Cloud, Azure, SLURM, PBS) which can be managed directly by Nextflow. This configuration file can be used across multiple runs of the workflow on the same computational system.

Reference Databases

To run the workflow, two reference databases are required. The PathSeq database can be downloaded from the Broad FTP server. The CellRanger database can be downloaded from the 10X Genomics website. The paths to the root directory of both of these databases will be required to run the workflow.

Parameters

For each individual run, a file with the parameters for each run should be created in JSON format, typically called params.json. The required parameters for the workflow are:

Authors

Analysis code was written by Hanrui Wu (hwu at fredhutch dot org). Workflow code was written by Samuel Minot (sminot at fredhutch dot org).