Mark PCR duplicates and sequencing platform / optical duplicates in single-end SAM files
Picard has a tool called MarkDuplicates for marking duplicates in SAM files, but it can only mark optical duplicates for paired-end reads. This script is written in Python and is a very fast program for doing the same for single-end reads.
PyPy is strongly recommended for its performance improvement over standard Python. It's several orders of magnitude faster for this program.
The input and output of the program are uncompressed SAM files in fastq coordinates. The program could easily be extended to add support for compression and decompression (though full BAM support would be more complicated), but if you have space concerns I suggest compression on a file-system basis.
To run the program:
pypy mark_duplicates.py input.sam output.sam pixels
Note that pixels
is a required argument. It is a radius (in qseq
pixel coordinates) within which reads will be marked as optical
duplicates.
Similar to the Picard tool, this program marks duplicates with the following notation:
DT:Z:SQ
for optical duplicatesDT:Z:LB
for PCR duplicatesThe duplicate flag is also set. Note that any candidate optical duplicate is also detected as a PCR duplicate but only marked as an optical duplicate. If your analysis requires removing one but not the other you should take this into account.
If you want either of these duplicates removed instead, this can be accomplished by changing only one line of the program. (Readability and making the program easy to change were major goals.)
This program was written to be used for a specific biology paper, and it makes the following assumptions, which may or may not be safe for your project.
n
reads are detected to be duplicates, only the
n-1
lower sequence quality reads will be marked as
duplicates. The best read will remain unmarked.This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.