BGI-shenzhen / VCF2Dis

VCF2Dis: A new simple and efficient software to calculate p-distance matrix and construct population phylogeny based Variant Call Format
MIT License
75 stars 20 forks source link

some issue about nan #4

Open shenhaizhongdechanrao opened 7 months ago

shenhaizhongdechanrao commented 7 months ago

Hello, I am trying to construct a tree using the VCF file obtained from merging with GATK and bcftools. After using the VCF2Dis command to generate a .mat file, but there are many '-nan' in the result. Can you give me some suggestions?

hewm2008 commented 7 months ago

You have too few vcf sites and too many miss genotypes. It is recommended to generate vcf directly from gvcf merging instead of bcftools merging, because this will cause many sites to be missed.

chanity256 commented 1 month ago

老师您好,我也出现了相同的问题。产生的.mat文件中有许多的nan,导致现在无法建树。这是我的代码:

1.合并所有的gvcf并进行joint callling

查找所有的 GVCF 文件

gvcf_files=$(find $gvcfgz_dir -type f -name "*.gvcf.gz")

构建输入文件列表并执行 Sentieon 命令

$SENTIEON_INSTALL_DIR/bin/sentieon driver -t $nt -r $reference \

--algo GVCFtyper \

$(for file in $gvcf_files; do echo -n "-v $file "; done) \

$output_vcf/${name_merged}.vcf

2.对合并的vcf文件SelectVariants-提取 SNPs

gatk --java-options "-Xmx50g" SelectVariants -R $reference -select-type-to-include SNP -V $output_vcf/${name_merged}.vcf -O $output_vcf/${name_merged}.snp.vcf

3.VariantFiltration SNP 硬过滤,并去除低质量的 SNP(也就是有SNP_Filter标记的行)

gatk --java-options "-Xmx50g" VariantFiltration -V $output_vcf/${name_merged}.snp.vcf --filter-expression 'QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' --filter-name 'SNP_Filter' -O $output_vcf/${name_merged}.snp.filtering.vcf

less $output_vcf/${name_merged}.snp.filtering.vcf | grep -v "SNP_Filter" > $output_vcf/${name_merged}.snp.filtered.vcf

4.vcftools再过滤

vcftools --vcf $output_vcf/${name_merged}.snp.filtered.vcf --max-missing 0.2 --minQ 30 --remove-indels --min-alleles 2 --max-alleles 2 --maf 0.05 --recode --recode-INFO-all --out $output_vcf/${name_merged}.snp.filtered.miss0.2maf0.05.vcf

请问是我的vcftools过滤条件的问题吗?

hewm2008 commented 1 month ago

这种情况 是你的数据问题 很大的概率就是你 vcf里面有一个样品严重miss . 你这个样品的测序深度太太低 or 你这个样品(外群太远了,比对ref老都比对不上),建议过滤mapping Q>10 小于50%的样品。