Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
24 stars 20 forks source link

Proposal: String representation/RleMatrix #38

Open jonocarroll opened 4 years ago

jonocarroll commented 4 years ago

I'd like to propose a couple of potential optimisations, particularly useful for cases where many entries are the same (e.g. AD: NA NA for GT: "./."). I'm working with a file which compresses nicely to .rds (~40MB) but is very large when in memory (17GB). This size becomes difficult to work with on a local machine or interactively.

Looping over the individual assays in a CompressedVcf I see that the largest ones are those which are matrices of lists (e.g. AD, AF, MBQ, ...). In theory (I think) they contain the same scale of information as GT, but their representation makes them much larger.

I appreciate that this list format makes working with the data cleaner than storing the entries as delimited strings (as I believe they natively are in the VCF) which likely requires parsing the string and splitting into a list-like structure anyway, but this is a tradeoff between size and usability, and applies to the entire dataset even if a user is only interested in a subset.

Would there be interest in a representation of CollapsedVcf list-matrices as either delimited strings (in which case they could be stored as RleMatrix for an additional compression boost; or RleListMatrix (which doesn't exist, but here's an open issue: https://github.com/Bioconductor/DelayedArray/issues/62)? I'm not well-versed enough in the Rle side to know if Rle provides a benefit when stored in a list, but it's an option.

Below is a comparison of object sizes for a toy example 'assay' constructed as a matrix of list elements and the size savings are potentially very large (in this case 1/62 the size).

## create a matrix of list elements
x <- matrix(
  replicate(
    2e4, 
    c(
      sample(c(NA, 1, 2), 1, prob = c(100, 1, 1)), 
      sample(c(NA, 1, 2), 1, prob = c(100, 1, 1))
    ), 
    simplify = FALSE), 
  ncol = 10
)
head(x)
#>      [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]     
#> [1,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [2,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [3,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [4,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [5,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [6,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#>      [,8]      [,9]      [,10]    
#> [1,] Numeric,2 Numeric,2 Numeric,2
#> [2,] Numeric,2 Numeric,2 Numeric,2
#> [3,] Numeric,2 Numeric,2 Numeric,2
#> [4,] Numeric,2 Numeric,2 Numeric,2
#> [5,] Numeric,2 Numeric,2 Numeric,2
#> [6,] Numeric,2 Numeric,2 Numeric,2

## collapse to singleton strings (credit: @lawremi)
v <- unlist(x)
v[is.na(v)] <- "."
x_str <- matrix(
  S4Vectors::unstrsplit(
    c(relist(v, x)), ","), 
  nrow(x), ncol(x))
head(x_str)
#>      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
#> [1,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [2,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [3,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [4,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [5,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [6,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."

suppressMessages(library(DelayedArray))

## Rle representation of matrix of list doesn't work
as(x, "RleArray")
#> Error in .Call2("Rle_constructor", values, lengths, PACKAGE = "S4Vectors"): Rle of type 'list' is not supported

## Rle representation of string matrix
x_rle <- as(x_str, "RleArray")
x_rle
#> <2000 x 10> matrix of class RleMatrix and type "character":
#>         [,1]  [,2]  [,3]  ... [,9]  [,10]
#> [1,]    ".,." ".,." ".,." .   ".,." ".,."
#> [2,]    ".,." ".,." ".,." .   ".,." ".,."
#> [3,]    ".,." ".,." ".,." .   ".,." ".,."
#> [4,]    ".,." ".,." ".,." .   ".,." ".,."
#> [5,]    ".,." ".,." ".,." .   ".,." ".,."
#> ...     .     .     .     .   .     .    
#> [1996,] ".,." ".,." ".,1" .   ".,." ".,."
#> [1997,] ".,." ".,." ".,." .   "2,." ".,."
#> [1998,] ".,." ".,." ".,." .   ".,." ".,."
#> [1999,] "1,." ".,." "1,." .   ".,." ".,."
#> [2000,] "2,." ".,." ".,." .   ".,." ".,."

## compare sizes
pryr::object_size(x)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 1.44 MB
pryr::object_size(x_str)
#> 161 kB
pryr::object_size(x_rle)
#> 23 kB

Created on 2020-02-28 by the reprex package (v0.3.0)

I'm not across this package enough to have any insights into implementation issues or how this might affect other aspects of the inner workings (or user-side workings) but this approach reduces a 1.71GB CollapsedVcf into a 20MB object of the same class without destroying any information. Credit to @lawremi for guidance towards optimising the conversion and on structural advice.

lawremi commented 4 years ago

Ideally we would use a light weight representation (like the proposed RleListMatrix) that is more computable than a delimited string.

jonocarroll commented 4 years ago

I'm not sure whether or not the Rle benefit will work across lists, but if it can then that may solve all of the problems.

lawremi commented 4 years ago

I think it would in this case, because of how the Rle compression helps the character case.

jonocarroll commented 4 years ago

Does it work in that context because a matrix is just a vector with a dim so you can actually compress long runs of sequential elements? I'm not sure I see how that manifests for the 'vector of lists' case but perhaps something clever can be done in the unlisting/relisting sense to take advantage of that.

lawremi commented 4 years ago

It will be efficient as long as the RleList is a CompressedList, meaning that internally there is just a flat vector that is lazily partitioned into list elements.

jonocarroll commented 4 years ago

In that case it may even be slightly more performant than the string version as there would be even fewer unique values over which a run may compress. Nice!