Zilong-Li / vcfppR

The fastest VCF/BCF parser in R https://doi.org/10.1093/bioinformatics/btae049
https://zilong-li.github.io/vcfppR/
Other
13 stars 3 forks source link

vcfcomp returns NAs #3

Closed kiran-lee closed 7 months ago

kiran-lee commented 7 months ago

Hi Zilong-Li,

Thank you for your software. I found it from your reply here.

I am currently trying to use vcfcomp to test concordance between some imputed genotypes and their original genotypes using the following script: library(vcfppR) res <- vcfcomp(test = "x.vcf.gz", truth = "y.vcf.gz", region="z|z", af="af.tsv", stats = "r2") str(res) write.table(res, file = "concordance", sep = "\t", col.names = T, row.names = T, quote = FALSE)

I used vcftools --freq to output the allele frequency for af.tsv.

However, this returns a dataframe fulls of NAs:

str(res) List of 2 $ samples: chr [1:25] "/fastdata/bop21kgl/RawData/LIMS26076p5/Clean_aligned/418-2054_221014_L003__all_mapped_rehead.bam" "/fastdata/bop21$ $ r2 : logi [1, 1:17] NA NA NA NA NA NA ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:17] "(0,1e-05]" "(1e-05,2e-05]" "(2e-05,5e-05]" "(5e-05,0.0001]" ... write.table(res, file = "concordance", sep = "\t", col.names = T, row.names = T, quote = FALSE)

proc.time() user system elapsed 190.199 5.955 196.606

Do you have a suggestion of where this is going wrong?

Zilong-Li commented 7 months ago

Hi Kiran,

I think the input region is not valid, which should be in the form of chr:start-end, or just chr.

I'll add validators for the input in next release. Thanks for the feedback.

Zilong-Li commented 7 months ago

Also make sure the af.tsv is valid, which should have a header line and five columns. If the af is calculated from the truth file, you can just make vcfcomp doing the job by ignoring the af option ie af=NULL

kiran-lee commented 7 months ago

Thank you so much for our email discussion troubleshooting my NAs.

We realised this was due to missing genotypes in the "truth" vcf file. Removing missing genotypes in the "truth" dataset beforehand produced a sensible r2 value. This was the script used:

library(vcfppR)

truth <- vcftable("x.vcf.gz", setid = T, info = F) test <- vcftable("y.vcf.gz", setid = T, format = "DS", info = F)

N <- 22 ### number of samples is 22 truth$gt <- truth$gt[,1:N] truth$samples <- test$samples[1:N]

remove sites with genotypes being NA in truth

w <- which(!is.na(rowSums(truth$gt))) truth$gt <- truth$gt[w,] truth[2:9] <- lapply(truth[2:9], function(a) a[w]) saveRDS(truth, file = "truth.rds")

names <- truth$samples[1:22] samples <- paste0(names, collapse = ",")

res <- vcfcomp(test = "y.vcf.gz" , truth = "truth.rds", stats = "r2", names = names, samples = samples, bins = c(0,1))

str(res) res