Reading in VCF's - Githubissues

Euphrasiologist commented 6 months ago

I believe you already have the functionality to do this? Would require a function like this:

read_vcf <- function(file, quiet = TRUE, ...) {
  inner_vcf <- vcfR::read.vcfR(file = file, verbose = !quiet, ...)
  as_gen_tibble(inner_vcf)
}

Would something like that be sufficient?

Cheers, M

dramanica commented 6 months ago

Hi @Euphrasiologist, we did have that functionality, but it is gone with the rewrite to use File Backed Matrices. Have a look at the branch fbm. I have switched from using snpBIN objects, as they were simply too slow to scale. I now use bigSNP objects from the package bigsnpr to store the data on disk. That brings us close to PLINK performance, easily handling a thousand of individuals with a million SNPs. I have a few minor things to clean up, but you can have a look at the vignette in that branch as a walk through of how things work now (sorry, I did ask Jason to let you know about the big changes, but the message must have not reached you in time).

dramanica commented 6 months ago

Now, a few thoughts on how vcf reading should work. The simplest option would be to convert the vcf to a bed. I could not find anything in R that does that, which is surprising. The right approach would be to read the vcf in chunks, then extract the biallelic snps, and write the bed with genio. It has an append option which allows to expand your bed. In that way, we could read massive vcf files. Writing directly to a FBM is a bit more tricky. We would have to first inspect the vcf, figure out the total number of biallelic sites (the columns in the matrix), and then initialise a FBM object (with an appropriate backing file). Then we would do the same as above, read the vcf in chunks, and fill in the columns of the FBM.

Euphrasiologist commented 6 months ago

Brilliant thanks for the info, I'll take a look and give it a whirl.

dramanica commented 6 months ago

Actually, you just inspired me to rescue a bit of the old as_gen_tibble. So, if you do a pull now on the fbm branch, you can do:

 vcf_path <- system.file("/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz",
                          package = "tidypopgen")
  bed_path <- gt_vcf_to_bed(vcf_path, bed_path = tempfile("anolis_"))
  test_gt <- gen_tibble(bed_path)

But it really only works for smallish vcf files where the whole content can be kept in memory for the conversion.

Euphrasiologist commented 6 months ago

Does your self assign mean you're happy to do this? I haven't started anything yet!

dramanica commented 6 months ago

@Euphrasiologist no, very happy for someone else to have a go. I'll give you wite access, just branch and pull request. If you want to be fancy, you could think about also writing directly an fbm instead of going through bed.

dramanica commented 6 months ago

@Euphrasiologist Have a look at the new method I wrote for gen_tibble which allows to create a gen_tibble from a matrix of genotypes (https://github.com/EvolEcolGroup/tidypopgen/blob/fbm/R/gen_tibble.R). I got bored of creating BED files for every test... In principle, it could be adapted to a vcf, either in one go (as the old as_gen_tibble you suggested), or better, reading the vcf in chunks and filling in the FBM (which would allow for much bigger datasets).

Euphrasiologist commented 6 months ago

I've accidentally committed straight to fbm branch instead of vcf, sorry! I'll add a test tomorrow: 411c5e8

dramanica commented 6 months ago

@Euphrasiologist No worries. I have moved the code under gen_tibble, as vcf files are then treated in the same way as .bed or .rds.

dramanica commented 6 months ago

This looks like a very good option to write a function to read in vcf files directly: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868310/

dramanica commented 6 months ago

A sensible strategy is: Open the vcf Read a chunk of x lines. Extract binary SNPs. Create an FBM with the right number of columns (markers) and individuals for this chunk. Read the next chunk. Extract binary SNPs. Add columns to the FBM and fill them with the new SNPs. Keep repeating until we get to end of file.

dramanica commented 6 months ago

I have added a simple count_vcf_variants in the branch vcf. That should facilitate parsing a vcf in chunks.

Euphrasiologist commented 6 months ago

I'd been working on something like this.

library(vcfR)
library(data.table)
library(bigstatsr)
setDTthreads(threads = 2)

# check dimensions of snp matrix using vcfR::read.vcfR()
# and iterate over the VCF in chunks. use the combination of the
# number of rows to read and the number of rows to skip

vcf_path <- "~/Documents/software/tidypopgen/inst/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz"

v <- fread(vcf_path)
colnm <- colnames(v)
vcf_dim <- dim(v)
nrow_ <- vcf_dim[1]

# using the nrow above, we can read the VCF in chunks
# of nrows and skip the first i * nrows rows
# split nrow into chunks of nrows with the last chunk
# being the remainder
nrows <- 1000
chunks_vec <- c(rep(nrows, floor(nrow_ / nrows)), nrow_ %% nrows)

# iterate over the chunks vec, read in the VCF and
# calculate the number of SNPs in each chunk

for (i in 1:length(chunks_vec)) {
  temp_vcf <- read.vcfR(vcf_path, nrow = chunks_vec[i], skip = sum(chunks_vec[1:(i - 1)]))
  gt <- vcfR::extract.gt(temp_vcf)
  #... todo
}

Do you think your implementation of reading the VCF would be faster than going over once with data.table::fread?

Related note. For the life of me I can't see how to append to an FBM. (https://search.r-project.org/CRAN/refmans/bigstatsr/html/FBM-class.html) There's a method to add columns, but not rows. I was thinking it's probably easier to write the matrix back to disk, merge, then convert to FBM?

dramanica commented 6 months ago

If all you are doing with read.table is to count lines, then I would try the function I wrote, so that we don't bring in a dependency just to count lines. I think it should be pretty quick. As for appending, luckily you are adding columns (loci are columns, individuals are rows). So, a row in the vcf is a column in the file backed matrix. So, just add the columns, and put in the data. For the record, you can't add rows easily to a FBM, I did for rbind, and you end up having to transpose the files and appending one to the other, pretty messy.

Euphrasiologist commented 6 months ago

I thought that transposing might be the way, but that's great news. Brilliant. I'll get on that.

dramanica commented 5 months ago

It might make sense to merge the vcf branch into ploidy, so that we can get vcf reading for diploid and polyploids working at the same time. Reading vcf is the last element (#19) (plus a couple more unit tests) for ploidy to be ready to be merged into main so that we officially start supporting polyploidy (and thus think about it as we develop new functions).

Euphrasiologist commented 5 months ago

Merged now (I hope), need to work on the function a bit more...

dramanica commented 4 months ago

I added multiple ploidy to the vcf reading code, and I think that reading vcf in chunks is now fully functional and ploidy compatible. Some more testing would be wise, though. Also, we should benchmark the multiple ploidy parser compared to the old specialised diploid version, to decide whether we can just keep the generic version, or whether an optimised diploid version needs to be brought back (it would be trivial to add a "assume_diploid=TRUE" parameter if needed).

Euphrasiologist commented 4 months ago

Okay, I'll have a look around for a largeish multiple ploidy VCF.

dramanica commented 4 months ago

@Euphrasiologist Have a look at the ploidy issue, there is a good vcf associated with a book chapter I linked to.

Euphrasiologist commented 4 months ago

Oops forgot about that!

dramanica commented 4 months ago

We now have multiple tests for VCF with diploid organisms, checking against bed files. So, diploid vcfs should be good (thanks to @eviecarter33!). For multiple ploidy, we have a basic test in tests/testthat/test_ploidy_vcf.R, but we need to add a bit of depth to that test. The thing to do would be to have a look at that minimalistic VCF and check which individuals are diploid and which tetraploid, and make sure that we have parsed that correctly. Also, hand checking a couple of genotypes would be good. Once we have those tests, I think we could consider closing this issue, as we can then read VCFs of any ploidy.

dramanica commented 4 months ago

The only other thing to consider is whether the parsing should be parallelised (we use an apply, but we could think about using a parallel apply to convert genotypes in dosages). I guess we should benchmark a bit how quickly we parse large vcfs and decide whether is needed.

Euphrasiologist commented 4 months ago

In parallelising, do we care about the order of the records in the VCF?

dramanica commented 4 months ago

Yes, I think we do, as a number of analysis care about order.

dramanica commented 3 months ago

I have written a C++ parser that is lean and mean, and is, in principle, quite a bit faster than the current parsing with vcfR. It is in the branch vcf_fast. But it needs some testing.

EvolEcolGroup / tidypopgen

Reading in VCF's #6