dkoboldt / varscan

Variant calling and somatic mutation/CNV detection for next-generation sequencing data
152 stars 34 forks source link

varscan fpfilter input #10

Open serge2016 opened 8 years ago

serge2016 commented 8 years ago

Hello! What for do we have to use bam-readcount? If I am right, varscan already gets this info from input BAMs and stores it in it's VCF!

serge2016 commented 8 years ago

Could you provide a detailed instruction of all steps needed to run fpfilter (VarScan false-positive filter)?

According to Dan's post1 and my own thoughts, I have to run next steps:

  1. Convert BAMs into mpileup. -B - disables BAQ -- recommended on VarScan site samtools mpileup -f ref.fa -B -o tumor.mpileup tumor.bam samtools mpileup -f ref.fa -B -o normal.mpileup normal.bam
  2. Run VarScan somatic and get snp.vcf & indel.vcf. java -jar VarScan.v2.4.2.jar somatic normal.mpileup tumor.mpileup --output-snp snp.vcf --output-indel indel.vcf --min-coverage 3 --min-var-freq 0.08 --p-value 0.10 --somatic-p-value 0.05 --strand-filter 0 --output-vcf 1
  3. Merge & sort them and get all.vcf.
  4. Run VarScan processSomatic and get (isolate) high-confidence calls: all.Somatic.hc.vcf, all.Germline.hc.vcf & all.LOH.hc.vcf all both with SNPs and InDels. java -jar VarScan.v2.4.2.jar processSomatic all.vcf --min-tumor-freq 0.08
  5. Get mutation coordinates (var, NOT BED) for each of 3 .hc.vcf files using awk. The difference is in the VCF's as START in BED and here, see Dan's post2. Moreover, in the same post we see that we have to use <col2+1> as START and STOP for deletions! So: awk 'BEGIN {OFS="\t"} {if (!/^#/) { isDel=(length($4) > length($5)) ? 1 : 0; print $1,($2+isDel),($2+isDel); }}' all.Somatic.hc.vcf > all.Somatic.hc.var and so on. The Dan's phrase "And if you want multiple positions reported, use multiple lines:" still remains unclear for me. Do I have to gerenate several lines for each insertion and deletion of more than 1 bp?
  6. Run bam-readcounter and get 3 .hc.readcount files. For somatic mutations, use readcounts from the tumor BAM. For LOH and germline variants, use readcounts from the normal BAM (see Dan's post3 and ppt2: -q INT - minimum mapping quality of reads used for counting [0] -b INT - minimum base quality at a position to use the read for counting [0] bam-readcount -q 1 -b 20 -l all.Somatic.hc.var -f ref.fa tumor.bam > all.Somatic.hc.readcount bam-readcount -q 1 -b 20 -l all.Germline.hc.var -f ref.fa normal.bam > all.Germline.hc.readcount bam-readcount -q 1 -b 20 -l all.LOH.hc.var -f ref.fa normal.bam > all.LOH.hc.readcount -b 15 may be found here.
  7. Run VarScan fpfilter for 3 .hc.vcf files. java -jar VarScan.v2.4.2.jar fpfilter all.Somatic.hc.vcf all.Somatic.hc.readcount --output-file all.Somatic.hc.filtered.vcf --dream3-settings and so on. Flags --min-ref-avgrl NN --min-var-avgrl NN are necessary if (trimmed) read length is less than default 90?
  8. Merge & sort to get all.hc.filtered.vcf.

Other mentions of var file creation:

VCF -> BED from site3 for SNVs & InDels: awk 'BEGIN {OFS="\t"} {if (!/^#/) { InsDelLen=(length($4) > length($5)) ? length($4) : length($5); print $1,($2-1),($2-1+InsDelLen); }}' all.vcf > all.bed, where we use from VCF as START and <col2-1+max(len(col4),len(col5))> as STOP position

iranmdl commented 7 years ago

Any news here? Is the workflow suggested by @serge2016 the correct/recommended one?

Neato-Nick commented 7 years ago

I didn't think that processSomatic and other subsequent commands would work on VCF's - the original documentation says to use VarScan's file format. Anyone else have updates? I'll be giving this workflow a shot.

serge2016 commented 7 years ago

@Neato-Nick, I use exactly this workflow. And VarScan works with VCF file format. Do you have any evidence that it doesn't?

Neato-Nick commented 7 years ago

Oh, awesome! It's just not mentioned in the documentation that subsequent scripts after the initial calling from Pileups would take VCF input. Good to know you've been using this flow for a while!

On Aug 14, 2017 9:49 PM, "Sergey Mitrfanov" notifications@github.com wrote:

@Neato-Nick https://github.com/neato-nick, I use exactly this workflow. And VarScan works with VCF file format. Do you have any evidence that it doesn't?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dkoboldt/varscan/issues/10#issuecomment-322376653, or mute the thread https://github.com/notifications/unsubscribe-auth/AFlB0j7kk95FfZ5RLL8WR3frmX8LtWuYks5sYSNWgaJpZM4JSFIb .

nileshgardi2688 commented 6 years ago

Is the workflow suggested by @serge2016 correct? Should I used this workflow for my analysis or any new correction need to be done? Kindly give answer to the issue raised by @serge2016 regarding indel positions in point no. 5.

Which one is correct for indels?

  1. awk 'BEGIN {OFS="\t"} {if (!/^#/) { isDel=(length($4) > length($5)) ? 1 : 0; print $1,($2+isDel),($2+isDel); }}' all.Somatic.hc.vcf > all.Somatic.hc.var OR
  2. awk 'BEGIN {OFS="\t"} {if (!/^#/) { InsDelLen=(length($4) > length($5)) ? length($4) : length($5); print $1,($2-1),($2-1+InsDelLen); }}' all.vcf > all.bed
serge2016 commented 6 years ago

Here is one more guide, but right now it describes rather old versions: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/#varscan

The key difference is in the very first step, samtools mpileup. GDC (TCGA) suggests to process both normal and tumor BAMs together and this provides different results.

Second thing, that I have figured out, is that VarScan's 'defaults' are different in the software description/help and in it's code. So we should specify values far all (even default) parameters.

Third thing is standalone FPFilter tool: https://github.com/ucscCancer/fpfilter-tool. It may be useful, but, as I understood and checked, it does the same as my point 5, 6 and partly 7.

nileshgardi2688 commented 6 years ago

@serge2016 Thanks for prompt reply. Firstly, I ran samtools mpileup using Tumor and Normal together as suggested by you. Second, I need to check for the defaults parameters and parameters in script to better understanding.

qins commented 6 years ago

Nice post, the tool is lack of documents.

nileshgardi2688 commented 6 years ago

Hello, I have basic question in varscan2 mutation calling. varscan2 divide total SNPs into germline, somatic, LOH and unknown based on allele frequency. My question is if all 3 genomes (i.e. hg19, blood and tumor) have different nucleotide (at same position), where do that SNP should be classify?

For e.g. 10th position of Chromosome 1 have "A" in hg19, Blood sample have "G" and tumor have "T". Coverage is adequate in both blood and tumor sample. Where do I classify this type of mutation?