CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

fgbio and umi-tools, and called variant with vardict, umi-tools get 8 times variants than fgbio #478

Closed worker000000 closed 3 years ago

worker000000 commented 3 years ago

I treat data and fgbio and umi-tools, and called variant with vardict, umi-tools get 8 times variants thanumi-tools, is there any inner reason, they both dealed with R1 and R2 umis

# thanks a lot

IanSudbery commented 3 years ago

umi-tools get 8 times variants thanumi-tools

Not sure I follow that? Which one produces more?

Differences between fgbio will depend on how you've used each tool.

fgbio GroupReadsByUMI is similar to umi-tools group - that is, it doesn't do any deduplication - it just assigns a tag to all reads that are judged to have come from the same template. As far as I'm aware, fgbio doesn't have a tool for actaully deduplicating the BAM file. It does have CallMolecularConsensusReads, which creates a consensus sequence for reads in the same UMI group. UMI-tools dedup doesn't do this - instead, for each group, it selects the highest quality read from the group and outputs that as a representative.

fgbio and umi-tools have different default UMI deduplication algorithms (at least in their default configurations). UMI-tools uses the directional deduplication algorithm that corrects for PCR/sequencing errors in UMI sequences. This algorithm can be selected in fgbio, but I don't believe it is the default. Even where the same algorithm is selected, we've never actaully benchmarked if the fgbio implementation produces the same results as the UMI-tools implementation.

worker000000 commented 3 years ago

@IanSudbery Thanks a lot for your comprehensive reply.

with vardict caller.

for a target sequencing of 100k panel, umitool get 1105 get variants, fgbio get 257 varians

#

for a target sequencing of 2000k panel, umitool get nearly 80000 variants, fgbio get nearly 10000 varians

with fgbio, it does not specify the deduplication algorithm that corrects for PCR/sequencing errors in UMI sequences. # http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html

worker000000 commented 3 years ago

Dear @IanSudbery , do you have some time to have a look at this, I have tested about 10 samples(target panel dna sequencing of ngs tumor paired sample), is there any inner reason of this, thanks a lot

worker000000 commented 3 years ago

1 I am doing somatic calling of paired samples of PE sequencing, I have Umi in both reads,

2 I call variants after do consensus 3 there are so many variants here, I do not have truth dataset, but both tools may have done this before I guess. 4 strategy are you using in GroupReadsByUmi java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar GroupReadsByUmi --input=K123.UMI.merged.bam --output=K123.UMI.group.bam --strategy=paired --min-map-q=30 --edits=1 --raw-tag=RX

5 umitools do like such umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I T_out.R1_TMP.fq.gz -S C_out.R1_TMP_umitools.fq.gz --read2-in=T_out.R2_TMP.fq.gz --read2-out=C_out.R2_TMP_umitools.fq.gz

then do bwa and samtools

umi_tools dedup -I ${dir1}_T.sorted.bam --output-stats=deduplicated -S ${dir1}_T_deduplicated.bam

IanSudbery commented 3 years ago

Hi @worker000000 - Are you running CallMolecluarConsensusReads after GroupReadsByUmi for the fgbio pipeline? Because otherwise GroupReadsByUmi doesn't actaully do any deduplication. If you are running CallMoleclularConsensusReads then there is a difference here in these two pipelines - fgbio is calling consensus reads, wherease UMI-tools is selecting a representative reads.

The other difference is that fgbio does quite a large amount of filtering for read quality. UMI-tools doesn't do that, leaving it to the user to do the filtering as they see fit.

worker000000 commented 3 years ago

Thanks a lot. @IanSudbery the following command is a full command of fgbio we used

we run CallMolecluarConsensusReads after GroupReadsByUmi for the fgbio pipeline the filter method do as the following a representative reads and a consensus reads? does the representative not come from consensus?

I am not understanding the difference between

java -Djava.io.tmpdir=TMP -jar bin/picard.jar **FastqToSam**    F1=WQ124/WQ124_1.fq.gz  F2=WQ124/WQ124_2.fq.gz  O=WQ124/WQ124.ubam  READ_GROUP_NAME=WQ124   SAMPLE_NAME=WQ124   LIBRARY_NAME=WQ124  PLATFORM_UNIT=HiseqX110  PLATFORM=illumina
java -Djava.io.tmpdir=TMP -jar bin/fgbio-1.1.0.jar **ExtractUmisFromBam**     --input=WQ124/WQ124.ubam      --output=WQ124/WQ124.UMI.ubam     --read-structure=1S3M1S+T 1S3M1S+T     --single-tag=RX     --molecular-index-tags=ZA ZB
bin/samtools **fastq** WQ124/WQ124.UMI.ubam | bin/bwa **mem** -t 4 -p database/bwa/hg19_chr.fa /dev/stdin | bin/samtools **view** -b >WQ124/WQ124.UMI.bam && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **MergeBamAlignment**   ORIENTATIONS=FR     R=database/bwa/hg19_chr.fa  UNMAPPED_BAM=WQ124/WQ124.UMI.ubam   ALIGNED_BAM=WQ124/WQ124.UMI.bam     O=WQ124/WQ124.UMI.merged.bam    CREATE_INDEX=true   MAX_GAPS=-1     ALIGNER_PROPER_PAIR_FLAGS=true  VALIDATION_STRINGENCY=SILENT    SO=coordinate   ATTRIBUTES_TO_RETAIN=XS

java -Djava.io.tmpdir=TMP -jar bin/fgbio-1.1.0.jar **GroupReadsByUmi**   --input=WQ124/WQ124.UMI.merged.bam   --output=WQ124/WQ124.UMI.group.bam   --strategy=paired --min-map-q=30 --edits=1 --raw-tag=RX && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar **CallMolecularConsensusReads**   --input=WQ124/WQ124.UMI.group.bam   --output=WQ124/WQ124.UMI.con.ubam   -M 1   --min-input-base-quality=20 && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar **FilterConsensusReads**   --input=WQ124/WQ124.UMI.con.ubam   --output=WQ124/WQ124.UMI.con.filter.ubam   --ref=database/bwa/hg19_chr.fa   --min-reads=2 1 1   --max-read-error-rate=0.05   --max-base-error-rate=0.1   --min-base-quality=20   --max-no-call-fraction=0.1 && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **SamToFastq**   I=WQ124/WQ124.UMI.con.filter.ubam   F=WQ124/WQ124.UMI.con.filter.fq   INTERLEAVE=true && bin/bwa mem -t 4 -p   database/bwa/hg19_chr.fa   WQ124/WQ124.UMI.con.filter.fq | bin/samtools **view** -b >WQ124/WQ124.UMI.con.filter.bam && java   -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **SortSam**   O=WQ124/WQ124.UMI.con.filter.sort.ubam   I=WQ124/WQ124.UMI.con.filter.ubam   SO=queryname && java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar SortSam   O=WQ124/WQ124.UMI.con.filter.sort.bam   I=WQ124/WQ124.UMI.con.filter.bam   SO=queryname
java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/picard.jar **MergeBamAlignment**    ORIENTATIONS=FR         UNMAPPED=WQ124/WQ124.UMI.con.filter.sort.ubam         ALIGNED=WQ124/WQ124.UMI.con.filter.sort.bam         Oalign/WQ124.UMI.con.bam         R=database/bwa/hg19_chr.fa         SO=coordinate         ALIGNER_PROPER_PAIR_FLAGS=true         MAX_GAPS=-1         VALIDATION_STRINGENCY=SILENT         CREATE_INDEX=true   ATTRIBUTES_TO_RETAIN=XS`

java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar **ClipBam**   --input=align/WQ124.UMI.con.bam   --output=align/WQ124._**UMI.con.clipped.bam**_   --ref=database/bwa/hg19_chr.fa   --clip-overlapping-reads=true

UMI.con.clipped.bam is the bam for vardict variant calling

IanSudbery commented 3 years ago

The primary difference between your UMI-tools pipeline and the fgbio pipeline here is that fgbio computes a consensus read for each UMI group, while UMI-tools selects a representative read. Thus the number of reads going into the vardict variant calling should be similar (or the same if fgbio really have correctly implemented the same alogrithmn as us), but the sequences of those reads may well be different.

worker000000 commented 3 years ago

@IanSudbery thanks a lot but I still do not know the difference of a representative read and a consensus read

but the sequences of those reads may well be different.

does here mean very different? why this happen, which one is more accurate?

IanSudbery commented 3 years ago

When UMI-tools find a group of reads it believes come from the same original molecule, it selects one of those reads (with the highest qualities and the lowest number of mapping locations) to output and discards the rest.

When CallMoecularConsensusReads finds a group of reads it believes come from the same original molecule, it aligns all the reads together, looks at each position, and creates a consensus call. I don't know the precise algorithm used, but for example, if 10 of the reads had A at position one, 5 T and 2 C, the consensus read would have A. I suspect the actual algorithmn used is more sophisticated than that.

worker000000 commented 3 years ago

thanks a lot@ IanSudbery so umitools does not make the consensus reads, does this may not make full use of UMI, because the origin purpose of umi is to get the consensus reads, I just suspect, different tools may have different design

IanSudbery commented 3 years ago

I'm not sure about "original purpose": UMIs were first proposed as a method of distinguishing PCR duplicates in RNA quantification in 2003 (Hug and Schuler, 2003), and as far as I can tell, the first use of them in high-throughput sequencing was iCLIP (Koing et al 2010), rapidly followed by uses in RNA-seq, particularly single cell RNAseq, ChIP-seq, ChIP-exo, 4C. In these applications the sequence of the reads is not important, only their alignment location and number. It was these applications we had in mind when we first created UMI-tools (indeed, the original prototype was created to analyze an iCLIP dataset).

The use of UMIs for constructing molecular consensus reads in amplicon sequencing didn't come until later, and the fgbio tools for UMIs were created with this application in mind.

IanSudbery commented 3 years ago

As for which is most accurate - we've never really done any comparisons. fgbio says that they implement the directional adjacency approach, but we've never checked how similar the results are (I'm not sure many fgbio users use that mode). In theory calling consensus reads should be more accurate than just selecting one read, but it does depend on how the consensus is created, and its not clear to me how that works in fgbio.

The other big difference is that fgbio does a lot of filtering automatically, where as we leave any filtering to the user. If I had to guess I'd guess that this filtering is more likely to be causing the excess of variants in the UMI-tools data than the consensus read calling.

worker000000 commented 3 years ago

@IanSudbery Thanks a lot, I tried to find many filters in umi_tools, but just find such --edit-distance-threshold --mapping-quality=MAPPING_QUALITY Minimum mapping quality for a read to be retained [default=0] --unmapped-reads=UNMAPPED_READS How to handle unmapped reads. Options are 'discard', 'use' or 'correct' [default=discard] --chimeric-pairs=CHIMERIC_PAIRS How to handle chimeric read pairs. Options are 'discard', 'use' or 'correct' [default=use] --unpaired-reads=UNPAIRED_READS

IanSudbery commented 3 years ago

We don't really implement filters - we take the Unix philosophy of doing one thing. You'll need to use picard or samtools.

worker000000 commented 3 years ago

We don't really implement filters - we take the Unix philosophy of doing one thing. You'll need to use picard or samtools.

thanks a lot for your kind help