fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
315 stars 68 forks source link

fgbio and umi-tools, and called variant with vardict, umi-tools get 8 times variants than fgbio #683

Closed worker000000 closed 3 years ago

worker000000 commented 3 years ago

Thanks a lot for your comprehensive software.

with vardict caller.

for a target sequencing of 100k panel, umitool get 1105 get variants, fgbio get 257 varians

for a target sequencing of 2000k panel, umitool get nearly 80000 variants, fgbio get nearly 10000 varians

with fgbio, it does not specify the deduplication algorithm that corrects for PCR/sequencing errors in UMI sequences. # http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html

Any advice/troubleshooting tips are appreciated!

fleharty commented 3 years ago

@worker000000 I'm not an fgbio developer, but I thought I would weigh in.

It's impossible to know here which tool is doing better than the other. It's important that you identify which variants are true positives and which are false positives. If you can't do that with your current data, it may be necessary to construct and sequence libraries for which you have good truth information.

From there, you can determine which tools are doing best, and then identify error modes from each of the tools.

nh13 commented 3 years ago

@worker000000 there are a lot of details missing from your description of what you're trying to do. Is there one UMI, two UMIs, duplex sequencing? What strategy are you using in GroupReadsByUmi? Are you calling consensus reads after grouping or calling variants directly? Is this a gemline sample, somatic sample, other? Do you have a truth sample with known variants (allele frequencies for somatic samples)? If so, can you verify the concordance? Have you looked at the overlap of called variants, and the one's unique to each method?

I think you need to go off and answer these questions and come back with data to support issues with the differences you're seeing, versus us doing this for you. Hopefully the above makes sense.

worker000000 commented 3 years ago

Thanks a lot for your kind and helpful suggestion.@nh13 @fleharty 1 I am doing somatic calling of paired samples of PE sequencing, I have Umi in both reads,

2 I call variants after do consensus 3 there are so many variants here, I do not have truth dataset, but both tools may have done this before I guess. 4 strategy are you using in GroupReadsByUmi java -Djava.io.tmpdir=TMP -Xmx100G -jar bin/fgbio-1.1.0.jar GroupReadsByUmi --input=K123.UMI.merged.bam --output=K123.UMI.group.bam --strategy=paired --min-map-q=30 --edits=1 --raw-tag=RX

5 umitools do like such umi_tools extract --bc-pattern=CNNNC --bc-pattern2=CNNNC --log=processed.log -I T_out.R1_TMP.fq.gz -S C_out.R1_TMP_umitools.fq.gz --read2-in=T_out.R2_TMP.fq.gz --read2-out=C_out.R2_TMP_umitools.fq.gz # then do bwa and samtools umi_tools dedup -I ${dir1}_T.sorted.bam --output-stats=deduplicated -S ${dir1}_T_deduplicated.bam