etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
540 stars 164 forks source link

snipfilter.sh: question #656

Open sotaro-kanematsu opened 3 years ago

sotaro-kanematsu commented 3 years ago

I tried to run the snpfilter.sh to obtain the vcf file for uncovering the allele-specific copy number alternations in the tumor samples. I prepared tumor bam file, common dbSNP vcf file, and hg38.fa, and run the shell, however sites.txt via bcftools view were null because _snp.vcf had the only header. my input vcf file is as follows. What is wrong?

fileformat=VCFv4.0

fileDate=20170710

source=dbSNP

dbSNP_BUILD_ID=150

reference=GRCh38.p7

phasing=partial

variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=SubSNP->Batch.link_out">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=5% minor allele frequency in each and all populations">

INFO=5% minor allele frequency in 1+ populations">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FILTER=

INFO=

INFO== 1% and for which 2 or more founders contribute to that minor allele frequency.">

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO

chr1 10177 rs367896724 A AC . . RS=367896724;RSPOS=10177;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020005170026000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP;VLD;G5A;G5;KGPhase3;CAF=0.5747,0.4253;COMMON=1 chr1 10352 rs555500075 T TA . . RS=555500075;RSPOS=10352;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000020005170026000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP;VLD;G5A;G5;KGPhase3;CAF=0.5625,0.4375;COMMON=1 chr1 10352 rs145072688 T TA . . RS=145072688;RSPOS=10353;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP;CAF=0.5625,0.4375;COMMON=1 chr1 10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C . . RS=376342519;RSPOS=10617;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000020005040026000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP;VLD;KGPhase3;CAF=0.006989,0

tetedange13 commented 3 years ago

Hi @sotaro-kanematsu ,

Not an author of CNVkit, plus I do not know a lot about this script => But my wild guess would be: have you checked if chromosome names match between your input FASTA, BAM and VCF files ?

This note in the script code warns about the fact that this name check is not (yet) made automatically => You are talking about using "hg38.fa" when you VCF header indeed mentions "##reference=GRCh38.p7" (5th line) and they are supposed to have 2 different chromosome naming conventions ("chr1" vs "1") => Also snpfilter.sh is supposed to work without an input FASTA file, so maybe 1st try to use it with BAM+VCF only? Then add a FASTA if it is OK ?

Hope this helps. Have a nice day. Felix.

sotaro-kanematsu commented 3 years ago

Hi @tetedange13 Thank you for your kindly advice! I will check reference verson, and try again!

sotaro-kanematsu commented 2 years ago

Hi @tetedange13 I checked the reference version and tried again. I understood how to generate input vcf file when I call allele-specific copy number alternation using CNVkit. All I know is that CNVkit uses vcf file that contatained GT, AD, and DP. GT value is indispensable for the extraction of mutations in germline- heterozygous positions. AD and DP are indispensable for calculation of b-allele frequency. So, here is the main part I want to discuss, I am wondering whether snpfilter script works correctly. The script uses samtools mpileup, and I tried as script shows, however, the call vcf file is somewhat strange. It looks like bcf file ( so, I can read only when the file format is converted by bcftools view), and does not contain information about genotype. (samtools version: 1.11 (using htslib 1.11), bcftools version: 1.8 (using htslib1.8))

I use bcftools mpileup as follows. So now, bcftools run soooooo slow, however it works well, and the results CNVkit called are comparable with Sequenza, or Freec.

  1. gzip -dc normal.vcf.gz| grep -w "0/1|0|1"| cut -f1,2 >tmp; cat tmp|awk {'printf ("%s\t%s\t%s\n", $1,$2-1,$2)'} >sites.bed #extract the germline-heterozygous position
  2. time bcftools mpileup --ignore-RG --skip-indels --annotate AD,DP --count-orphans -f GRCh38.d1.vd1.fa -R sites.bed normal.bam tumor.bam |bcftools call -mv -Oz > normal_tumor.vcf.gz #joint variant calling
  3. cnvkit.py call tumor.cns -y --sample-id tumor.bam --normal-id normal.bam -v normal_tumor.vcf --purity X -m clonal --ploidy Y --filter ci -o tuomr.call_tmp.cns #addjust copy number using purity and ploidy value, and call allele-specific copy number alternations
  4. cnvkit.py segmetrics tumor.cnr -s tumor.call.vcf_tmp.cns --t-test -o tumor.recall.vcf.cns # calculate p-value