alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

bamRemoveDuplicatesType - explanation not clear #805

Open gevro opened 4 years ago

gevro commented 4 years ago

Hi, The manual says that bamRemoveDuplicatesType only applies to: "string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only"

Do both (i) and (ii) need to be true or (i) or (ii)?

alexdobin commented 4 years ago

Hi @gevro

both need to be true: the algorithm is designed for PE reads, and it needs sorted BAM file as input. The duplicate removal is based on comparing the alignments of different reads, and removing the identical ones, which would not work for SE reads as it will over-collapse reads in highly expressed loci.

Cheers Alex

iaaka commented 2 years ago

Hi @alexdobin, related question, does STARsolo mark somehow reads that were collapsed to the same UMI? I mean, that all but one of reads that were collapsed should be consideres as duplicates. But as far as I can see 0x400 flag is never set "ON" in STARsolo output and description of "bamRemoveDuplicatesType" parameter suggest that it is not applicable to single cell data (cell barcodes and UMI are not mentioned). So, is there any chance to get this information? Cellranger seems to use this flag.

alexdobin commented 2 years ago

Hi Pasha,

presently, STARsolo does not output the "duplication" flag. There were a few questions about it, so I will implement this feature in the future. Note, that there is no principled way to select a representative read from a collection of reads with the same CB/UB/GX tags, so I am not sure how useful such "deduplication" is.

iaaka commented 2 years ago

Thank you! Surely it makes sense only if the purpose reads used for is invariant for all reads from same CB/UB/GX group (that is tricky, I'm agree). I'm trying to quantify alternative splicing events and want to figure out how to collapse reads to UMI in this case, for now I collapse them by chr-start-cigar-cb-ub, that is probably enaugh

alexdobin commented 2 years ago

Hi Pasha,

the reads with the same UMI/CB can map to different splice junctions (or, generally, portions of the genes). If you only consider one "representative" read, you may lose some of the junctions covered by other reads. If you are interested in splice junctions, STARsolo can actually output counts of UMIs overlapping junctions: with --soloFeatures SJ