GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Support for UMIs #20

Open mjafin opened 8 years ago

mjafin commented 8 years ago

Hi There, Thanks for samblaster, it's such a great tool. I wonder if you have considered supporting unique molecular identifiers at all? bcl2fastq supports adding them to the read name. So if a standard read name was: @M01277:231:000000000-A8KH6:1:1101:21912:1442 1:N:0:0 then it might be like this when an ID is added @M01277:231:000000000-A8KH6:1:1101:21912:1442:TTTCCT 1:N:0:0 If there are two barcodes then they are separated by a plus, e.g. @M01277:231:000000000-A8KH6:1:1101:21912:1442:TTTCCT+AACCTT 1:N:0:0 (I think - I only have data from one-barcode source).

The additional condition would be to check in the read signature if the putative duplicates share the ID.

Nugen have an algorithm here https://github.com/nugentechnologies/nudup/blob/master/nudup.py but it has so much overhead it would be great if samblaster supported the IDs.

mmterpstra commented 8 years ago

hi @mjafin, This might be offtopic but i have DigitalReadGroups that integrates the random barcodes to the readname and after sorting/adding readgroups splits them to different readgroups (for each sample) based on this insertion in the header. Read more about it on DigitalBarcodeReadgroups. Maybe you'll find some helpful suggestions.

Also my i ask how you run bcl2fastq & configure your samplesheet? I might consider supporting your usecase also.

Best MM Terpstra

mjafin commented 8 years ago

Thanks @mmterpstra I'll look into that.

Actually bcl2fastq doesn't support our case out of the box. It supports UMIs that are at the start of the actual reads and then puts those into the read name. It doesn't support well UMIs that are in the index reads. We can get the UMIs out of the index reads but then we have _1, and _3 for the actual data and _2 fastq for the UMIs (if using a single UMI).

I asked Illumina if they could start supporting UMIs in indices out of the box but they just said that they "can't provide a timeline for such a request". Bummer.

mmterpstra commented 8 years ago

Ok same experience here. The README.md from DigitalBarcodeReadgroups describes that also. I hope my notes are readable.

Also notice my trimming by probe alignment location instead of trimming by probe sequence (because when the read still aligns to the reference after the probe seq it is likely to be a real sequence instead of artificial sequence). For now it only works with Single end seq (RNA/DNA). I hope to add Paired End support soonish.

egafni commented 6 years ago

+1

bwlang commented 6 years ago

+1 UMIs can be easily added to the bam stream just before samblaster...

seqtk mergepe L2_Sigma_Set2_567_733UID.1.fastq.gz L2_Sigma_Set2_567_733UID.3.fastq.gz | \
 bwa mem -p -t 4 -R"@RG\tID:test\tSM:test"  genome.fa \
   /dev/stdin 2> test.log.bwamem | \
 fgbio AnnotateBamWithUmis \
   -i /dev/stdin -f test.2.fastq.gz  -o /dev/stdout  2> test.annotate_bam.log | \
 sambamba view -t 2 -l 0 -f sam /dev/stdin | \ #could be skipped if samblaster could take bam input,
 samblaster 2> test.dedup_log | \
 sambamba view -t 2 -S -f bam test.bam /dev/stdin
dawe commented 5 years ago

+1 UMI and cellular barcodes may be part of a SAM file as custom tags (such as CB, UB and so on). In that case it may be easy to use tags to build the read hash used to mark duplicates. More in general, it may be useful if one could specify a set of tags to be used in the duplicate detection