Closed worker000000 closed 3 years ago
umi-tools get 8 times variants thanumi-tools
Not sure I follow that? Which one produces more?
Differences between fgbio will depend on how you've used each tool.
fgbio GroupReadsByUMI
is similar to umi-tools group
- that is, it doesn't do any deduplication - it just assigns a tag to all reads that are judged to have come from the same template. As far as I'm aware, fgbio doesn't have a tool for actaully deduplicating the BAM file. It does have CallMolecularConsensusReads
, which creates a consensus sequence for reads in the same UMI group. UMI-tools dedup
doesn't do this - instead, for each group, it selects the highest quality read from the group and outputs that as a representative.
fgbio and umi-tools have different default UMI deduplication algorithms (at least in their default configurations). UMI-tools uses the directional deduplication algorithm that corrects for PCR/sequencing errors in UMI sequences. This algorithm can be selected in fgbio, but I don't believe it is the default. Even where the same algorithm is selected, we've never actaully benchmarked if the fgbio implementation produces the same results as the UMI-tools implementation.
@IanSudbery Thanks a lot for your comprehensive reply.
with vardict caller.
for a target sequencing of 100k panel, umitool get 1105 get variants, fgbio get 257 varians
#
for a target sequencing of 2000k panel, umitool get nearly 80000 variants, fgbio get nearly 10000 varians
with fgbio, it does not specify the deduplication algorithm that corrects for PCR/sequencing errors in UMI sequences. # http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html
Dear @IanSudbery , do you have some time to have a look at this, I have tested about 10 samples(target panel dna sequencing of ngs tumor paired sample), is there any inner reason of this, thanks a lot
1 I am doing somatic calling of paired samples of PE sequencing, I have Umi in both reads,
2 I call variants after do consensus 3 there are so many variants here, I do not have truth dataset, but both tools may have done this before I guess. 4 strategy are you using in GroupReadsByUmi java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar GroupReadsByUmi --input=K123.UMI.merged.bam --output=K123.UMI.group.bam --strategy=paired --min-map-q=30 --edits=1 --raw-tag=RX
5 umitools do like such umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I T_out.R1_TMP.fq.gz -S C_out.R1_TMP_umitools.fq.gz --read2-in=T_out.R2_TMP.fq.gz --read2-out=C_out.R2_TMP_umitools.fq.gz
umi_tools dedup -I ${dir1}_T.sorted.bam --output-stats=deduplicated -S ${dir1}_T_deduplicated.bam
Hi @worker000000 - Are you running CallMolecluarConsensusReads after GroupReadsByUmi for the fgbio pipeline? Because otherwise GroupReadsByUmi doesn't actaully do any deduplication. If you are running CallMoleclularConsensusReads then there is a difference here in these two pipelines - fgbio is calling consensus reads, wherease UMI-tools is selecting a representative reads.
The other difference is that fgbio does quite a large amount of filtering for read quality. UMI-tools doesn't do that, leaving it to the user to do the filtering as they see fit.
Thanks a lot. @IanSudbery the following command is a full command of fgbio we used
we run CallMolecluarConsensusReads after GroupReadsByUmi for the fgbio pipeline the filter method do as the following a representative reads and a consensus reads? does the representative not come from consensus?
I am not understanding the difference between
java -Djava.io.tmpdir=TMP -jar bin/picard.jar **FastqToSam** F1=WQ124/WQ124_1.fq.gz F2=WQ124/WQ124_2.fq.gz O=WQ124/WQ124.ubam READ_GROUP_NAME=WQ124 SAMPLE_NAME=WQ124 LIBRARY_NAME=WQ124 PLATFORM_UNIT=HiseqX110 PLATFORM=illumina
java -Djava.io.tmpdir=TMP -jar bin/fgbio-1.1.0.jar **ExtractUmisFromBam** --input=WQ124/WQ124.ubam --output=WQ124/WQ124.UMI.ubam --read-structure=1S3M1S+T 1S3M1S+T --single-tag=RX --molecular-index-tags=ZA ZB
bin/samtools **fastq** WQ124/WQ124.UMI.ubam | bin/bwa **mem** -t 4 -p database/bwa/hg19_chr.fa /dev/stdin | bin/samtools **view** -b >WQ124/WQ124.UMI.bam && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **MergeBamAlignment** ORIENTATIONS=FR R=database/bwa/hg19_chr.fa UNMAPPED_BAM=WQ124/WQ124.UMI.ubam ALIGNED_BAM=WQ124/WQ124.UMI.bam O=WQ124/WQ124.UMI.merged.bam CREATE_INDEX=true MAX_GAPS=-1 ALIGNER_PROPER_PAIR_FLAGS=true VALIDATION_STRINGENCY=SILENT SO=coordinate ATTRIBUTES_TO_RETAIN=XS
java -Djava.io.tmpdir=TMP -jar bin/fgbio-1.1.0.jar **GroupReadsByUmi** --input=WQ124/WQ124.UMI.merged.bam --output=WQ124/WQ124.UMI.group.bam --strategy=paired --min-map-q=30 --edits=1 --raw-tag=RX && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar **CallMolecularConsensusReads** --input=WQ124/WQ124.UMI.group.bam --output=WQ124/WQ124.UMI.con.ubam -M 1 --min-input-base-quality=20 && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar **FilterConsensusReads** --input=WQ124/WQ124.UMI.con.ubam --output=WQ124/WQ124.UMI.con.filter.ubam --ref=database/bwa/hg19_chr.fa --min-reads=2 1 1 --max-read-error-rate=0.05 --max-base-error-rate=0.1 --min-base-quality=20 --max-no-call-fraction=0.1 && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **SamToFastq** I=WQ124/WQ124.UMI.con.filter.ubam F=WQ124/WQ124.UMI.con.filter.fq INTERLEAVE=true && bin/bwa mem -t 4 -p database/bwa/hg19_chr.fa WQ124/WQ124.UMI.con.filter.fq | bin/samtools **view** -b >WQ124/WQ124.UMI.con.filter.bam && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **SortSam** O=WQ124/WQ124.UMI.con.filter.sort.ubam I=WQ124/WQ124.UMI.con.filter.ubam SO=queryname && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar SortSam O=WQ124/WQ124.UMI.con.filter.sort.bam I=WQ124/WQ124.UMI.con.filter.bam SO=queryname
java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **MergeBamAlignment** ORIENTATIONS=FR UNMAPPED=WQ124/WQ124.UMI.con.filter.sort.ubam ALIGNED=WQ124/WQ124.UMI.con.filter.sort.bam Oalign/WQ124.UMI.con.bam R=database/bwa/hg19_chr.fa SO=coordinate ALIGNER_PROPER_PAIR_FLAGS=true MAX_GAPS=-1 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true ATTRIBUTES_TO_RETAIN=XS`
java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar **ClipBam** --input=align/WQ124.UMI.con.bam --output=align/WQ124._**UMI.con.clipped.bam**_ --ref=database/bwa/hg19_chr.fa --clip-overlapping-reads=true
UMI.con.clipped.bam is the bam for vardict variant calling
The primary difference between your UMI-tools pipeline and the fgbio pipeline here is that fgbio computes a consensus read for each UMI group, while UMI-tools selects a representative read. Thus the number of reads going into the vardict variant calling should be similar (or the same if fgbio really have correctly implemented the same alogrithmn as us), but the sequences of those reads may well be different.
@IanSudbery thanks a lot but I still do not know the difference of a representative read and a consensus read
but the sequences of those reads may well be different.
does here mean very different? why this happen, which one is more accurate?
When UMI-tools find a group of reads it believes come from the same original molecule, it selects one of those reads (with the highest qualities and the lowest number of mapping locations) to output and discards the rest.
When CallMoecularConsensusReads finds a group of reads it believes come from the same original molecule, it aligns all the reads together, looks at each position, and creates a consensus call. I don't know the precise algorithm used, but for example, if 10 of the reads had A at position one, 5 T and 2 C, the consensus read would have A. I suspect the actual algorithmn used is more sophisticated than that.
thanks a lot@ IanSudbery so umitools does not make the consensus reads, does this may not make full use of UMI, because the origin purpose of umi is to get the consensus reads, I just suspect, different tools may have different design
I'm not sure about "original purpose": UMIs were first proposed as a method of distinguishing PCR duplicates in RNA quantification in 2003 (Hug and Schuler, 2003), and as far as I can tell, the first use of them in high-throughput sequencing was iCLIP (Koing et al 2010), rapidly followed by uses in RNA-seq, particularly single cell RNAseq, ChIP-seq, ChIP-exo, 4C. In these applications the sequence of the reads is not important, only their alignment location and number. It was these applications we had in mind when we first created UMI-tools (indeed, the original prototype was created to analyze an iCLIP dataset).
The use of UMIs for constructing molecular consensus reads in amplicon sequencing didn't come until later, and the fgbio tools for UMIs were created with this application in mind.
As for which is most accurate - we've never really done any comparisons. fgbio says that they implement the directional adjacency approach, but we've never checked how similar the results are (I'm not sure many fgbio users use that mode). In theory calling consensus reads should be more accurate than just selecting one read, but it does depend on how the consensus is created, and its not clear to me how that works in fgbio.
The other big difference is that fgbio does a lot of filtering automatically, where as we leave any filtering to the user. If I had to guess I'd guess that this filtering is more likely to be causing the excess of variants in the UMI-tools data than the consensus read calling.
@IanSudbery Thanks a lot, I tried to find many filters in umi_tools, but just find such --edit-distance-threshold --mapping-quality=MAPPING_QUALITY Minimum mapping quality for a read to be retained [default=0] --unmapped-reads=UNMAPPED_READS How to handle unmapped reads. Options are 'discard', 'use' or 'correct' [default=discard] --chimeric-pairs=CHIMERIC_PAIRS How to handle chimeric read pairs. Options are 'discard', 'use' or 'correct' [default=use] --unpaired-reads=UNPAIRED_READS
We don't really implement filters - we take the Unix philosophy of doing one thing. You'll need to use picard or samtools.
We don't really implement filters - we take the Unix philosophy of doing one thing. You'll need to use picard or samtools.
thanks a lot for your kind help
I treat data and fgbio and umi-tools, and called variant with vardict, umi-tools get 8 times variants thanumi-tools, is there any inner reason, they both dealed with R1 and R2 umis
# thanks a lot