Demultiplex files and prepare reads for the target capture analysis pipeline.
Demultiplexing by internal barcode is done with cutadapt
.
The other steps are done with bbmap
scripts.
Read pairing is maintained by processing the R1 and R2 files together. Singletons generated by adaptor trimming are also output to a file with unpaired in the name.
tcdemux
is in bioconda. You can use the biocontainer hosted on quay.io with Docker or Apptainer/Singularity, e.g.:
# apptainer / singularity
apptainer exec \
docker://quay.io/biocontainers/tcdemux:0.0.17--pyhdfd78af_0 \
tcdemux
# Docker
docker pull \
quay.io/biocontainers/tcdemux:0.0.17--pyhdfd78af_0
You can also install it with conda, e.g.
mamba create --name tcdemux tcdemux
Manual installation is not supported, but if you need to do it, here are the steps:
bbmap
and make sure it's in your path.R
with the packages data.table
, bit64
, ggplot2
and viridis
.tcdemux
with python3 -m pip install git+git://github.com/tomharrop/tcdemux.git
. pip
will install the python3 dependencies biopython, cutadapt, pandas and snakemake.tcdemux
requires a sample_data file in csv format with the fields name
, i5_index
, i7_index
, r1_file
, and r2_file
.
Provide the csv to tcdemux
using the --sample_data
argument.
Here's an example sample_data file:
i5_index,i7_index,name,r1_file,r2_file
AGCGCTAG,CCGCGGTT,sample1,sample1_r1.fastq,sample1_r2.fastq
GAACATAC,GCTTGTCA,sample2,sample2_r1.fastq,sample2_r2.fastq
tcdemux
will process the sample1 and sample2 files separately, resulting in output files called sample1.r1.fastq.gz, sample1.r2.fastq.gz and sample1.unpaired.fastq.gz, and the equivalents for sample2.
tcdemux
does not demultiplex the samples in this case.
If the sample_data file also has a pool_name
field, tcdemux
will demultiplex the pools by internal index sequence.
This also requires the internal_index_sequence
field in the csv.
Here's an example sample_data file:
pool_name,i5_index,i7_index,name,internal_index_sequence,r1_file,r2_file
pool1,AGCGCTAG,CCGCGGTT,sample1,GTGACATC,pool_r1.fastq,pool_r2.fastq
pool1,AGCGCTAG,CCGCGGTT,sample2,ACTGGCTA,pool_r1.fastq,pool_r2.fastq
In this case, sample1 and sample2 are multiplexed in pool1 with internal barcodes.
tcdemux
will demultiplex the pool before trimming and masking, resulting in the same files as above.
Sample names will be checked for characters that are not uppercase or lowercase letters, digits, or underscores. The names will also be checked for double underscores. If any of these characters are found, the pipeline will print a message end exit.
These characters cause issues for other software used in target capture analysis.
You can fix this by changing the names in the sample_data and running tcdemux
again.
tcdemux
does not allow barcode errorsExternal barcodes are checked for errors before trimming and masking, and reads with barcode errors are discarded.
Barcode errors are sometimes allowed in the Illumina workflow. You can check if your fastq files have barcode errors like this:
grep '^@' path/to/file.fastq \
| head -n 1000 \
| cut -d':' -f10 \
| sort \
| uniq -c
If you see more than one barcode, then barcode errors were allowed in the Illumina workflow.
tcdemux
uses exact barcode matches with no errors allowed when it demultiplexes by internal barcode.
You also need to provide paths to the raw read directory and an output directory, and at least one adaptor file for trimming.
If you want to keep the intermediate files, pass the --keep_intermediate_files
argument.
The pipeline uses 5 threads and about 8 GB of RAM per sample.
Provide multiples of these using the --threads
and --mem_gb
arguments.
usage: tcdemux [-h] [-n] [--threads int] [--mem_gb int] [--restart_times RESTART_TIMES]
--sample_data SAMPLE_DATA_FILE --read_directory READ_DIRECTORY --adaptors
ADAPTOR_FILES [ADAPTOR_FILES ...] --outdir OUTDIR
[--keep_intermediate_files | --no-keep_intermediate_files]
options:
-h, --help show this help message and exit
-n Dry run
--threads int Number of threads.
--mem_gb int Amount of RAM in GB.
--restart_times RESTART_TIMES
number of times to restart failing jobs (default 0)
--sample_data SAMPLE_DATA_FILE
Sample csv (see README)
--read_directory READ_DIRECTORY
Directory containing the read files
--adaptors ADAPTOR_FILES [ADAPTOR_FILES ...]
FASTA file(s) of adaptors. Multiple adaptor files can be used.
--outdir OUTDIR Output directory
--keep_intermediate_files, --no-keep_intermediate_files