TomHarrop / tcdemux

GNU General Public License v3.0
1 stars 0 forks source link

tcdemux

DOI

Demultiplex files and prepare reads for the target capture analysis pipeline.

  1. Check external barcodes
  2. If the libraries are pooled, demux by internal barcode
  3. Verify pairing
  4. Trim adaptors
  5. Mask low-complexity regions, and optionally trim low-quality bases.
  6. Collect stats on steps 1--5.

Demultiplexing by internal barcode is done with cutadapt. The other steps are done with bbmap scripts.

Read pairing is maintained by processing the R1 and R2 files together. Singletons generated by adaptor trimming are also output to a file with unpaired in the name.

Installation

install with bioconda

tcdemux is in bioconda. You can use the biocontainer hosted on quay.io with Docker or Apptainer/Singularity, e.g.:

# apptainer / singularity
apptainer exec \
    docker://quay.io/biocontainers/tcdemux:0.0.17--pyhdfd78af_0 \
    tcdemux

# Docker
docker pull \
   quay.io/biocontainers/tcdemux:0.0.17--pyhdfd78af_0

You can also install it with conda, e.g.

mamba create --name tcdemux tcdemux

Manual installation

Manual installation is not supported, but if you need to do it, here are the steps:

  1. Install bbmap and make sure it's in your path.
  2. Install R with the packages data.table, bit64, ggplot2 and viridis.
  3. Install tcdemux with python3 -m pip install git+git://github.com/tomharrop/tcdemux.git. pip will install the python3 dependencies biopython, cutadapt, pandas and snakemake.

Usage

External barcodes only

tcdemux requires a sample_data file in csv format with the fields name, i5_index, i7_index, r1_file, and r2_file.

Provide the csv to tcdemux using the --sample_data argument.

Here's an example sample_data file:

i5_index,i7_index,name,r1_file,r2_file
AGCGCTAG,CCGCGGTT,sample1,sample1_r1.fastq,sample1_r2.fastq
GAACATAC,GCTTGTCA,sample2,sample2_r1.fastq,sample2_r2.fastq

tcdemux will process the sample1 and sample2 files separately, resulting in output files called sample1.r1.fastq.gz, sample1.r2.fastq.gz and sample1.unpaired.fastq.gz, and the equivalents for sample2. tcdemux does not demultiplex the samples in this case.

Additional, internal barcodes

If the sample_data file also has a pool_name field, tcdemux will demultiplex the pools by internal index sequence. This also requires the internal_index_sequence field in the csv.

Here's an example sample_data file:

pool_name,i5_index,i7_index,name,internal_index_sequence,r1_file,r2_file
pool1,AGCGCTAG,CCGCGGTT,sample1,GTGACATC,pool_r1.fastq,pool_r2.fastq
pool1,AGCGCTAG,CCGCGGTT,sample2,ACTGGCTA,pool_r1.fastq,pool_r2.fastq

In this case, sample1 and sample2 are multiplexed in pool1 with internal barcodes. tcdemux will demultiplex the pool before trimming and masking, resulting in the same files as above.

Sample names

Sample names will be checked for characters that are not uppercase or lowercase letters, digits, or underscores. The names will also be checked for double underscores. If any of these characters are found, the pipeline will print a message end exit.

These characters cause issues for other software used in target capture analysis.

You can fix this by changing the names in the sample_data and running tcdemux again.

tcdemux does not allow barcode errors

External barcodes are checked for errors before trimming and masking, and reads with barcode errors are discarded.

Barcode errors are sometimes allowed in the Illumina workflow. You can check if your fastq files have barcode errors like this:

grep '^@' path/to/file.fastq \
    | head -n 1000 \
    | cut -d':' -f10 \
    | sort \
    | uniq -c

If you see more than one barcode, then barcode errors were allowed in the Illumina workflow.

tcdemux uses exact barcode matches with no errors allowed when it demultiplexes by internal barcode.

Other options

You also need to provide paths to the raw read directory and an output directory, and at least one adaptor file for trimming.

If you want to keep the intermediate files, pass the --keep_intermediate_files argument.

The pipeline uses 5 threads and about 8 GB of RAM per sample. Provide multiples of these using the --threads and --mem_gb arguments.

usage: tcdemux [-h] [-n] [--threads int] [--mem_gb int] [--restart_times RESTART_TIMES]
               --sample_data SAMPLE_DATA_FILE --read_directory READ_DIRECTORY --adaptors
               ADAPTOR_FILES [ADAPTOR_FILES ...] --outdir OUTDIR
               [--keep_intermediate_files | --no-keep_intermediate_files]

options:
  -h, --help            show this help message and exit
  -n                    Dry run
  --threads int         Number of threads.
  --mem_gb int          Amount of RAM in GB.
  --restart_times RESTART_TIMES
                        number of times to restart failing jobs (default 0)
  --sample_data SAMPLE_DATA_FILE
                        Sample csv (see README)
  --read_directory READ_DIRECTORY
                        Directory containing the read files
  --adaptors ADAPTOR_FILES [ADAPTOR_FILES ...]
                        FASTA file(s) of adaptors. Multiple adaptor files can be used.
  --outdir OUTDIR       Output directory
  --keep_intermediate_files, --no-keep_intermediate_files

Overview

With internal barcodes

Snakemake rulegraph

With only external barcodes

Snakemake rulegraph