Closed Euphrasiologist closed 3 weeks ago
Hi @Euphrasiologist, we did have that functionality, but it is gone with the rewrite to use File Backed Matrices. Have a look at the branch fbm
. I have switched from using snpBIN
objects, as they were simply too slow to scale. I now use bigSNP
objects from the package bigsnpr
to store the data on disk. That brings us close to PLINK performance, easily handling a thousand of individuals with a million SNPs.
I have a few minor things to clean up, but you can have a look at the vignette in that branch as a walk through of how things work now (sorry, I did ask Jason to let you know about the big changes, but the message must have not reached you in time).
Now, a few thoughts on how vcf reading should work.
The simplest option would be to convert the vcf to a bed. I could not find anything in R that does that, which is surprising. The right approach would be to read the vcf in chunks, then extract the biallelic snps, and write the bed with genio
. It has an append option which allows to expand your bed. In that way, we could read massive vcf files.
Writing directly to a FBM is a bit more tricky. We would have to first inspect the vcf, figure out the total number of biallelic sites (the columns in the matrix), and then initialise a FBM object (with an appropriate backing file). Then we would do the same as above, read the vcf in chunks, and fill in the columns of the FBM.
Brilliant thanks for the info, I'll take a look and give it a whirl.
Actually, you just inspired me to rescue a bit of the old as_gen_tibble
. So, if you do a pull now on the fbm
branch, you can do:
vcf_path <- system.file("/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz",
package = "tidypopgen")
bed_path <- gt_vcf_to_bed(vcf_path, bed_path = tempfile("anolis_"))
test_gt <- gen_tibble(bed_path)
But it really only works for smallish vcf files where the whole content can be kept in memory for the conversion.
Does your self assign mean you're happy to do this? I haven't started anything yet!
@Euphrasiologist no, very happy for someone else to have a go. I'll give you wite access, just branch and pull request. If you want to be fancy, you could think about also writing directly an fbm instead of going through bed.
@Euphrasiologist Have a look at the new method I wrote for gen_tibble
which allows to create a gen_tibble
from a matrix of genotypes (https://github.com/EvolEcolGroup/tidypopgen/blob/fbm/R/gen_tibble.R). I got bored of creating BED files for every test...
In principle, it could be adapted to a vcf, either in one go (as the old as_gen_tibble
you suggested), or better, reading the vcf in chunks and filling in the FBM (which would allow for much bigger datasets).
I've accidentally committed straight to fbm branch instead of vcf, sorry! I'll add a test tomorrow: 411c5e8
@Euphrasiologist No worries. I have moved the code under gen_tibble
, as vcf files are then treated in the same way as .bed or .rds.
This looks like a very good option to write a function to read in vcf files directly: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868310/
A sensible strategy is: Open the vcf Read a chunk of x lines. Extract binary SNPs. Create an FBM with the right number of columns (markers) and individuals for this chunk. Read the next chunk. Extract binary SNPs. Add columns to the FBM and fill them with the new SNPs. Keep repeating until we get to end of file.
I have added a simple count_vcf_variants
in the branch vcf
. That should facilitate parsing a vcf in chunks.
I'd been working on something like this.
library(vcfR)
library(data.table)
library(bigstatsr)
setDTthreads(threads = 2)
# check dimensions of snp matrix using vcfR::read.vcfR()
# and iterate over the VCF in chunks. use the combination of the
# number of rows to read and the number of rows to skip
vcf_path <- "~/Documents/software/tidypopgen/inst/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz"
v <- fread(vcf_path)
colnm <- colnames(v)
vcf_dim <- dim(v)
nrow_ <- vcf_dim[1]
# using the nrow above, we can read the VCF in chunks
# of nrows and skip the first i * nrows rows
# split nrow into chunks of nrows with the last chunk
# being the remainder
nrows <- 1000
chunks_vec <- c(rep(nrows, floor(nrow_ / nrows)), nrow_ %% nrows)
# iterate over the chunks vec, read in the VCF and
# calculate the number of SNPs in each chunk
for (i in 1:length(chunks_vec)) {
temp_vcf <- read.vcfR(vcf_path, nrow = chunks_vec[i], skip = sum(chunks_vec[1:(i - 1)]))
gt <- vcfR::extract.gt(temp_vcf)
#... todo
}
Do you think your implementation of reading the VCF would be faster than going over once with data.table::fread
?
Related note. For the life of me I can't see how to append to an FBM. (https://search.r-project.org/CRAN/refmans/bigstatsr/html/FBM-class.html) There's a method to add columns, but not rows. I was thinking it's probably easier to write the matrix back to disk, merge, then convert to FBM?
If all you are doing with read.table
is to count lines, then I would try the function I wrote, so that we don't bring in a dependency just to count lines. I think it should be pretty quick.
As for appending, luckily you are adding columns (loci are columns, individuals are rows). So, a row in the vcf is a column in the file backed matrix. So, just add the columns, and put in the data.
For the record, you can't add rows easily to a FBM, I did for rbind
, and you end up having to transpose the files and appending one to the other, pretty messy.
I thought that transposing might be the way, but that's great news. Brilliant. I'll get on that.
It might make sense to merge the vcf
branch into ploidy
, so that we can get vcf reading for diploid and polyploids working at the same time. Reading vcf is the last element (#19) (plus a couple more unit tests) for ploidy
to be ready to be merged into main
so that we officially start supporting polyploidy (and thus think about it as we develop new functions).
Merged now (I hope), need to work on the function a bit more...
I added multiple ploidy to the vcf reading code, and I think that reading vcf in chunks is now fully functional and ploidy compatible. Some more testing would be wise, though. Also, we should benchmark the multiple ploidy parser compared to the old specialised diploid version, to decide whether we can just keep the generic version, or whether an optimised diploid version needs to be brought back (it would be trivial to add a "assume_diploid=TRUE" parameter if needed).
Okay, I'll have a look around for a largeish multiple ploidy VCF.
@Euphrasiologist Have a look at the ploidy issue, there is a good vcf associated with a book chapter I linked to.
Oops forgot about that!
We now have multiple tests for VCF with diploid organisms, checking against bed files. So, diploid vcfs should be good (thanks to @eviecarter33!). For multiple ploidy, we have a basic test in tests/testthat/test_ploidy_vcf.R, but we need to add a bit of depth to that test. The thing to do would be to have a look at that minimalistic VCF and check which individuals are diploid and which tetraploid, and make sure that we have parsed that correctly. Also, hand checking a couple of genotypes would be good. Once we have those tests, I think we could consider closing this issue, as we can then read VCFs of any ploidy.
The only other thing to consider is whether the parsing should be parallelised (we use an apply, but we could think about using a parallel apply to convert genotypes in dosages). I guess we should benchmark a bit how quickly we parse large vcfs and decide whether is needed.
In parallelising, do we care about the order of the records in the VCF?
Yes, I think we do, as a number of analysis care about order.
I have written a C++ parser that is lean and mean, and is, in principle, quite a bit faster than the current parsing with vcfR
. It is in the branch vcf_fast
. But it needs some testing.
I believe you already have the functionality to do this? Would require a function like this:
Would something like that be sufficient?
Cheers, M