Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

readVcf error: if (NROW(flesh) != flesh_len2) stop("shape of 'skeleton' is not compatible with 'NROW(flesh)'") #71

Closed jr1320 closed 1 year ago

jr1320 commented 1 year ago

When using VariantAnnotation:: readVcf(path,"hg19") to read large Vcf file, specifically dbSNP-155 sourced from (https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz)

error output:

Error in if (NROW(flesh) != flesh_len2) stop("shape of 'skeleton' is not compatible with 'NROW(flesh)'") :
  missing value where TRUE/FALSE needed
Calls: <Anonymous> ... CharacterList -> <Anonymous> -> .coerceToList -> relist -> relist
In addition: Warning message:
In PartitioningByEnd(from) :
  integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
Execution halted

R version 4.2.2 bioconductor-variantannotation 1.44.0 r42hc0cfd56_0 bioconda

thanks

mtmorgan commented 1 year ago

Provide the output of the R command sessionInfo() please. Please provide the exact command that you are using.

readVcf(path, "hg19") implies a very large memory (Linux) machine. Likely what we are seeing is an integer overflow, and this should be detected / handled in a more informative way, or fixed.

But I wonder what your scientific question is, and whether it really requires reading such a large file into memory? If you were querying this for specific regions, then one might indexVcf() and then just read the (small number of) regions of interest into R using the param= argument of readVcf(). If one wanted to process the entire file, then one might want to iterate through the file in chunks (e.g., section 6.2 of the GenomicFiles vignette) or filter the overall VCF to just those variants of interest.

jr1320 commented 1 year ago

I am using the HPC cluster so have access to a large amount of memory, and havent encountered issues with similar files yet. I am currently using the readVcf() command to read in the whole Vcf to memory, then using a while loop to convert chunks at a time to a data.frame in order to pull data from this as I want summary stats on the file as a whole. I dont think I can give sessioninfo() as such, but can provide details of the conda env the job is using?

mtmorgan commented 1 year ago

In the script you're using you could replace readVcf() and all subsequent lines with sessionInfo() and submit the job... this would tell me the version of R, as well as of VariantAnnotation and relevant packages (in particular Rsamtools and Rhtslib; generally one would like BiocManager::valid() to return TRUE), and the operating system (integer overflows, which I suspect is at the root of this problem, is particularly common on Windows 32bit architectures).

But I would not adopt your strategy of reading the whole file into memory. Instead I would download the file and it's index (with .tbi extension), then in R attach the VariantAnnotation and GenomicFiles packages.

suppressPackageStartupMessages({
    library(VariantAnnotation)
    library(GenomicFiles)
})

I would then open the file so that it reads 'chunks' of, say, a million records at a time

vcf_file <- VcfFile("~/tmp/GCF_000001405.40.gz", yieldSize = 1000000)

I'd then write a function that summarized a chunk (it seems like you've already done that)

summarize_vcf <-
    function(vcf)
{
    message(".") # a little progress indicator...
    ## any computation on this 'chunk'
    data.frame(n = nrow(vcf))
}

and then iterate through the file in chunks of a million records at a time

result <- reduceByYield(vcf_file, readVcf, summarize_vcf, `c`)

This would give a list of data.frame (or whatever summarize_vcf() returns), one for each chunk, that could then be combined with, e.g., do.call(rbind, result).

This approach is not likely to take meaningfully longer than reading the entire file into memory and then performing summaries, since you are doing the same amount of work. The approach does not require a large memory machine (I can do it on my laptop, for instance) so that you have access to many more resources (all the standard nodes in your HPC) for instance for parallel processing of several files. It allows incremental development of summarize_vcf() (e.g., use open(vcf_file) and then vcf <- readVcf(vcf_file) to read the first million records, and use vcf to develop and test summarize_vcf()).

mtmorgan commented 1 year ago

@hpages is this warning

In addition: Warning message:
In PartitioningByEnd(from) :
  integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

(which must really be an error...) avoidable in PartitioningByEnd() ?

hpages commented 1 year ago

Basically the AtomicList constructors (defined in IRanges) are not cautious enough when the input is too big:

library(IRanges)
x <- RleList(Rle(8, 2e9), Rle(5, 2e9))
# Error in if (NROW(flesh) != flesh_len2) stop("shape of 'skeleton' is not compatible with 'NROW(flesh)'") : 
#   missing value where TRUE/FALSE needed
# In addition: Warning message:
# In PartitioningByEnd(from) :
#   integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

This is a situation where one is supposed to use compress=FALSE:

x <- RleList(Rle(8, 2e9), Rle(5, 2e9), compress=FALSE)
x
# RleList of length 2
# [[1]]
# numeric-Rle of length 2000000000 with 1 run
#   Lengths: 2000000000
#   Values :          8
#
# [[2]]
# numeric-Rle of length 2000000000 with 1 run
#   Lengths: 2000000000
#   Values :          5

I improved this in IRanges 2.33.1 (see https://github.com/Bioconductor/IRanges/commit/d7ea1516af669b6eac080844ebacb8f218619065):

x <- RleList(Rle(8, 2e9), Rle(5, 2e9))
# Error in RleList(Rle(8, 2e+09), Rle(5, 2e+09)) : 
#   input of RleList() is too big for 'compress=TRUE' (the cumulated length
#   of the list elements in the resulting CompressedRleList object would
#   exceed 2^31); please call the constructor with 'compress=FALSE' instead

Maybe we should consider supporting compress=NA (and make it the default). It would do compress=TRUE if the input is not too big, otherwise it would fallback to compress=FALSE.

In the meantime, code that makes use of the AtomicList constructors and wants to support big objects needs to implement their own check before they call the constructor, in order to decide if they should use compress=TRUE or compress=FALSE.

H.

jr1320 commented 1 year ago

Thanks for the help! It does seem the issue was caused by the large size of the VCF file.