tophat-recondition is a post-processor for TopHat unmapped reads (contained in unmapped.bam), making them compatible with downstream tools (e.g., the Picard suite, samtools, GATK) (TopHat issue #17). It also works around bugs in TopHat:
This software was developed as part of a PhD research project in the laboratory of Lao H. Saal, Translational Oncogenomics Unit, Department of Oncology and Pathology, Lund University, Sweden.
A detailed description of the software can be found in Brueffer and Saal (2016).
TopHat-Recondition is available for installation with the conda package manager via the bioconda channel: conda install -c bioconda tophat-recondition
usage: tophat-recondition.py [-h] [-l LOGFILE] [-m MAPPED_FILE] [-q]
[-r RESULT_DIR] [-u UNMAPPED_FILE] [-v]
tophat_result_dir
Post-process TopHat unmapped reads. For detailed information on the issues
this software corrects, please consult the software homepage:
https://github.com/cbrueffer/tophat-recondition
positional arguments:
tophat_result_dir directory containing TopHat mapped and unmapped read
files.
optional arguments:
-h, --help show this help message and exit
-l LOGFILE, --logfile LOGFILE
log file (optional, (default: result_dir/tophat-
recondition.log)
-m MAPPED_FILE, --mapped-file MAPPED_FILE
Name of the file containing mapped reads (default:
accepted_hits.bam)
-q, --quiet quiet mode, no console output
-r RESULT_DIR, --result_dir RESULT_DIR
directory to write unmapped_fixup.bam to (default:
tophat_output_dir)
-u UNMAPPED_FILE, --unmapped-file UNMAPPED_FILE
Name of the file containing unmapped reads (default:
unmapped.bam)
-v, --version show program's version number and exit
Please make sure tophat_output_dir contains both, the mapped file (default: accepted_hits.bam) and the unmapped file (default: unmapped.bam). The fixed reads will be written to a file with the unmapped file name stem and the suffix _fixup, e.g. unmapped_fixup.bam, in result_dir.
Note: The unmapped file is read into memory, so make sure your computer has enough RAM to fit it.
Specifically, the script does the following (see SAM format specification for details on the fields in capital letters):
Fixes wrong flags resulting from a bug in TopHat:
Removes /1 and /2 suffixes from read names (QNAME), if present.
Sets mapping quality (MAPQ) for unmapped reads to 0. TopHat sets it to 255 which some downstream tools don't like (even though it is a valid value according to the SAM specification).
If an unmapped read's paired read is mapped, set the following fields in the unmapped read (downstream tools like Picard AddOrReplaceReadGroups get confused by the values TopHat fills in for those fields):
For unmapped reads with missing mapped mates, unset the mate-related flags to effectively make them unpaired. The following flags are unset:
Examples of error messages emitted by downstream tools when trying to process unmapped reads without some or all of these modifications can be found in this thread in the SEQanswers forum, which lead to the development of this software.
If you use this software in your research and would like to cite it, please use the citation information in the CITATION file.