Empty VCF in output - Githubissues

ckandoth commented 5 years ago

Nice work with malva. I tried this on several R1 FASTQs of paired-end 150x2 whole-genome sequencing datasets, and end up with empty VCFs in all cases. Can you help debug? Here are the commands used:

mkdir sample_name
kmc -t16 -m64 -k43 -f /path/to/fastqs/fastq_name_S4_R1_001.fastq.gz sample_name/sample_name.res sample_name
malva-geno -k35 -r43 -b16 /path/to/reference/b37.fasta fingerprint_snps.vcf sample_name/sample_name.res > sample_name/sample_name.fingerprint.vcf

Logs from kmc reported:

Stats:
   No. of k-mers below min. threshold :    956248069
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :   1888206968
   No. of unique counted k-mers       :    931958899
   Total no. of k-mers                :  12689712777
   Total no. of reads                 :    116433462
   Total no. of super-k-mers          :    792031281

Logs from malva-geno reported:

[malva-geno/Reference parsing] Time elapsed 0s
[malva-geno/Reference processed] Time elapsed 6s
[malva-geno/VCF parsing (Bloom Filter construction)] Time elapsed 6s
[malva-geno/Processed 5000 variants] Time elapsed 18s
[malva-geno/Processed 6246 variants] Time elapsed 18s
[malva-geno/BF creation complete] Time elapsed 27s
[malva-geno/Reference BF construction] Time elapsed 27s
[malva-geno/Reference BF creation complete] Time elapsed 80s
[malva-geno/KMC output processing] Time elapsed 94s

ckandoth commented 5 years ago

malva-geno produced valid VCF output when I used your example/chr20.vcf. But it produced empty VCF when used with fingerprint_snps.vcf - a subset of the ExAC VCF downloaded from ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/subsets/. It includes an INFO/AF tag, so I'm not sure what caused this issue. Worth debugging. I'll leave this issue open if you want to investigate. In the meantime, I will use a subset of SNPs from the 1000genome VCF as you do.

ckandoth commented 5 years ago

My subset of SNPs from the 1000genome VCF also produced an empty output VCF. This might have something to do with the way I subset SNPs for genotyping.

Here is how to reproduce:

# Fetch the 1000genomes phase3 VCF, and its tabix index:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz{.tbi,}

# Shortlist SNPs appropriate for genotyping (commonly heterozygous within subpopulations):
bcftools view --include 'EAS_AF>0.4 & EAS_AF<0.6 & EUR_AF>0.4 & EUR_AF<0.6 & AFR_AF>0.4 & AFR_AF<0.6 & AMR_AF>0.4 & AMR_AF<0.6 & SAS_AF>0.4 & SAS_AF<0.6' --exclude-types indels,mnps,ref,bnd,other ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz | bcftools annotate --remove ^INF/AF,INF/EAS_AF,INF/EUR_AF,INF/AFR_AF,INF/AMR_AF,INF/SAS_AF --output fingerprint_snps.vcf

# Limit the VCF to chr20 for use with the malva test data:
grep -E "^#|^20" fingerprint_snps.vcf > fingerprint_snps_chr20.vcf

# Download malva test dataset and build k-mer reference:
wget https://github.com/AlgoLab/malva/raw/master/example/data.tar.gz
tar -zxf data.tar.gz
mkdir kmc_tmp
kmc -m4 -k43 -fm chr20.sample.fa kmc.out kmc_tmp

# Run malva-geno to genotype the fingerprint SNPs shortlisted earlier:
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf

The fingerprint_snps_chr20.genotyped.vcf above will contain only VCF header lines, and no variants listed. When I allow the input VCF to contain rare alleles, I seem to get variants in my output.

ldenti commented 5 years ago

Hi, thanks for opening the issue. We'll look into it and we'll let you know asap.

Luca

ckandoth commented 5 years ago

@ldenti I was able to get it working by adding fake FORMAT and sample columns with GT populated as 0|0 for all variants. The sample ID in my input VCF does not seem to matter since your output VCF will replace it with "DONOR".

perl -i -a -F'\t' -pe 'chomp; $s="\tFORMAT\tSAMPLE_ID" if(/^#CHROM/); $s="\tGT\t0|0" unless(/^#/); $_.="$s\n"' fingerprint_snps_chr20.vcf
malva-geno -k35 -r43 -b1 -f EUR_AF chr20.fa fingerprint_snps_chr20.vcf kmc.out > fingerprint_snps_chr20.genotyped.vcf

ldenti commented 5 years ago

I see. Yes, the problem is that there must be the sample columns in the input VCF. Indeed, malva needs these columns to characterize each variant by computing its signatures.

Therefore if no sample columns are present in the input VCF or if you use only 0|0 as dummy column, then malva would not work properly and the output VCF would be meaningless.

So, if you want to use the VCF provided by the 1000genomesproject, then you have to download the ALL.chr*. files and concatenate them with bcftools (see this script).

ckandoth commented 5 years ago

From your paper: "For each variant allele, assign a signature in the form of a set of k-mers". You can do this with the ALT in the VCF, and there is no need for GT. You could technically use GT to build a genotype likelihood model i.e. whether each variant is more commonly heterozygous/homozygous across individuals, but I don't see that described in the paper. Can you roughly describe how malva-geno uses GT from the input VCF?

I really appreciate your insight. We intend to use malva to genotype the >350k pairs of FASTQs archived at MSKCC - and use that to identify new samples. If successful, it would become a critical QC step in our infrastructure. But most of our data is targeted sequencing or RNA, so we would benefit from using SNPs from ExAC instead of 1000genomes.

ldenti commented 5 years ago

Yes, it is right: we assign a signature, that is a set of kmers, to each variant.

But actually we assign multiple signatures to the same variant. For example, let us assume to have two single-allelic SNPs that occur within k/2 bases. Then, to characterize each allele of each variant, we should use 2 different signature: one taking into account the ref allele of the close variant and the other taking into account the alternate allele. This is just an example but in same cases, there could be more than 2 variants that are close and they could also be multi-allelic variants. To avoid to make all the possible combinations (that can be exponential in k), we base our computation only on those combinations that actually occur in the population used to produce the VCF.

I hope this is clear, if not, let me know and I'll try to explain it in a clearer way. In any case, you can find a more detailed explanation in the supplemental material: see definition 1 and definition 2.

Maybe if the GT fields are not available we can just make a single signature without taking in consideration close variants. It should be easy to do, but I think that it could lower the global accuracy of the tool. If your files do not have GT fields, I could try to implement and test that.

ckandoth commented 5 years ago

Very clear. Thanks Luca. I'll close this ticket, because my original problem is resolved.

AlgoLab / malva

Empty VCF in output #1