Closed ckandoth closed 5 years ago
malva-geno produced valid VCF output when I used your example/chr20.vcf
. But it produced empty VCF when used with fingerprint_snps.vcf
- a subset of the ExAC VCF downloaded from ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/subsets/. It includes an INFO/AF
tag, so I'm not sure what caused this issue. Worth debugging. I'll leave this issue open if you want to investigate. In the meantime, I will use a subset of SNPs from the 1000genome VCF as you do.
My subset of SNPs from the 1000genome VCF also produced an empty output VCF. This might have something to do with the way I subset SNPs for genotyping.
Here is how to reproduce:
# Fetch the 1000genomes phase3 VCF, and its tabix index:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz{.tbi,}
# Shortlist SNPs appropriate for genotyping (commonly heterozygous within subpopulations):
bcftools view --include 'EAS_AF>0.4 & EAS_AF<0.6 & EUR_AF>0.4 & EUR_AF<0.6 & AFR_AF>0.4 & AFR_AF<0.6 & AMR_AF>0.4 & AMR_AF<0.6 & SAS_AF>0.4 & SAS_AF<0.6' --exclude-types indels,mnps,ref,bnd,other ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz | bcftools annotate --remove ^INF/AF,INF/EAS_AF,INF/EUR_AF,INF/AFR_AF,INF/AMR_AF,INF/SAS_AF --output fingerprint_snps.vcf
# Limit the VCF to chr20 for use with the malva test data:
grep -E "^#|^20" fingerprint_snps.vcf > fingerprint_snps_chr20.vcf
# Download malva test dataset and build k-mer reference:
wget https://github.com/AlgoLab/malva/raw/master/example/data.tar.gz
tar -zxf data.tar.gz
mkdir kmc_tmp
kmc -m4 -k43 -fm chr20.sample.fa kmc.out kmc_tmp
# Run malva-geno to genotype the fingerprint SNPs shortlisted earlier:
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf
The fingerprint_snps_chr20.genotyped.vcf
above will contain only VCF header lines, and no variants listed. When I allow the input VCF to contain rare alleles, I seem to get variants in my output.
Hi, thanks for opening the issue. We'll look into it and we'll let you know asap.
Luca
@ldenti I was able to get it working by adding fake FORMAT and sample columns with GT
populated as 0|0
for all variants. The sample ID in my input VCF does not seem to matter since your output VCF will replace it with "DONOR".
perl -i -a -F'\t' -pe 'chomp; $s="\tFORMAT\tSAMPLE_ID" if(/^#CHROM/); $s="\tGT\t0|0" unless(/^#/); $_.="$s\n"' fingerprint_snps_chr20.vcf
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf
I see. Yes, the problem is that there must be the sample columns in the input VCF. Indeed, malva needs these columns to characterize each variant by computing its signatures.
Therefore if no sample columns are present in the input VCF or if you use only 0|0 as dummy column, then malva would not work properly and the output VCF would be meaningless.
So, if you want to use the VCF provided by the 1000genomesproject, then you have to download the ALL.chr*.
files and concatenate them with bcftools
(see this script).
From your paper: "For each variant allele, assign a signature in the form of a set of k-mers". You can do this with the ALT
in the VCF, and there is no need for GT
. You could technically use GT
to build a genotype likelihood model i.e. whether each variant is more commonly heterozygous/homozygous across individuals, but I don't see that described in the paper. Can you roughly describe how malva-geno uses GT
from the input VCF?
I really appreciate your insight. We intend to use malva to genotype the >350k pairs of FASTQs archived at MSKCC - and use that to identify new samples. If successful, it would become a critical QC step in our infrastructure. But most of our data is targeted sequencing or RNA, so we would benefit from using SNPs from ExAC instead of 1000genomes.
Yes, it is right: we assign a signature, that is a set of kmers, to each variant.
But actually we assign multiple signatures to the same variant. For example, let us assume to have two single-allelic SNPs that occur within k/2 bases. Then, to characterize each allele of each variant, we should use 2 different signature: one taking into account the ref allele of the close variant and the other taking into account the alternate allele. This is just an example but in same cases, there could be more than 2 variants that are close and they could also be multi-allelic variants. To avoid to make all the possible combinations (that can be exponential in k), we base our computation only on those combinations that actually occur in the population used to produce the VCF.
I hope this is clear, if not, let me know and I'll try to explain it in a clearer way. In any case, you can find a more detailed explanation in the supplemental material: see definition 1 and definition 2.
Maybe if the GT fields are not available we can just make a single signature without taking in consideration close variants. It should be easy to do, but I think that it could lower the global accuracy of the tool. If your files do not have GT fields, I could try to implement and test that.
Very clear. Thanks Luca. I'll close this ticket, because my original problem is resolved.
Nice work with malva. I tried this on several R1 FASTQs of paired-end 150x2 whole-genome sequencing datasets, and end up with empty VCFs in all cases. Can you help debug? Here are the commands used:
Logs from kmc reported:
Logs from malva-geno reported: