Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

merging VCF objects #3

Open lawremi opened 6 years ago

lawremi commented 6 years ago

It would be nice to have an efficient utility to merge two VCF objects that have differing (but often overlapping) sets of variants and samples. I guess the assay matrices would be filled with NAs. Maybe assume that the rowData() and colData() have the same columns, for now. Also assume that only the ranges and alt need to match (like VRanges matching). Currently we are using bcftools for this, but it would be nice to move to a more integrated workflow.

I guess a simple implementation would be to coerce the two (or more) VCF objects to VRanges objects, c() them together and then coerce back to VCF, preserving the info() and geno() distinction.

vobencha commented 6 years ago

Sounds like a good idea and you have some specific criteria in mind. Please feel free to implement or submit a pull request.

lawremi commented 6 years ago

Basically put this here to record it. Not to imply that anyone in particular should work on it.

baderd commented 5 years ago

I was also searching for a VCF merge functionality in R. This might be helpful for windows users who have a more difficult access to the Linux bash bcftools merge command.

bschilder commented 2 years ago

I would also find this functionality helpful. In my case, I'm simply trying to concatenate a list of VCFs that are actually just one VCF that's been chunked.

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") 
vcf <- VariantAnnotation::readVcf(fl)

vcf1 <- vcf[1:nrow(vcf)/2,]
vcf2 <- vcf[(nrow(vcf)/2+1):nrow(vcf),]
vcf_list <- list(vcf1, vcf2)

# Something like this?
## vcf_reconstructed <- do.call("rbind",vcf_list)

Thanks!, Brian

Session info

```` R version 4.2.0 (2022-04-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] MungeSumstats_1.4.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 matrixStats_0.62.0 [3] fs_1.5.2 bit64_4.0.5 [5] filelock_1.0.2 progress_1.2.2 [7] httr_1.4.3 rprojroot_2.0.3 [9] GenomeInfoDb_1.32.1 googleAuthR_2.0.0 [11] GenomicFiles_1.32.0 tools_4.2.0 [13] utf8_1.2.2 R6_2.5.1 [15] DBI_1.1.2 BiocGenerics_0.42.0 [17] withr_2.5.0 tidyselect_1.1.2 [19] prettyunits_1.1.1 bit_4.0.4 [21] curl_4.3.2 compiler_4.2.0 [23] cli_3.3.0 Biobase_2.56.0 [25] xml2_1.3.3 desc_1.4.1 [27] DelayedArray_0.22.0 rtracklayer_1.56.0 [29] rappdirs_0.3.3 stringr_1.4.0 [31] digest_0.6.29 Rsamtools_2.12.0 [33] R.utils_2.11.0 XVector_0.36.0 [35] pkgconfig_2.0.3 sessioninfo_1.2.2 [37] MatrixGenerics_1.8.0 dbplyr_2.1.1 [39] fastmap_1.1.0 BSgenome_1.64.0 [41] rlang_1.0.2 rstudioapi_0.13 [43] RSQLite_2.2.13 BiocIO_1.6.0 [45] generics_0.1.2 jsonlite_1.8.0 [47] BiocParallel_1.30.0 dplyr_1.0.9 [49] R.oo_1.24.0 VariantAnnotation_1.42.0 [51] RCurl_1.98-1.6 magrittr_2.0.3 [53] GenomeInfoDbData_1.2.8 Matrix_1.4-1 [55] Rcpp_1.0.8.3 S4Vectors_0.34.0 [57] fansi_1.0.3 lifecycle_1.0.1 [59] R.methodsS3_1.8.1 stringi_1.7.6 [61] yaml_2.3.5 SummarizedExperiment_1.26.1 [63] zlibbioc_1.42.0 brio_1.1.3 [65] BiocFileCache_2.4.0 grid_4.2.0 [67] blob_1.2.3 parallel_4.2.0 [69] crayon_1.5.1 lattice_0.20-45 [71] Biostrings_2.64.0 GenomicFeatures_1.48.0 [73] hms_1.1.1 KEGGREST_1.36.0 [75] pillar_1.7.0 GenomicRanges_1.48.0 [77] rjson_0.2.21 biomaRt_2.52.0 [79] stats4_4.2.0 pkgload_1.2.4 [81] XML_3.99-0.9 glue_1.6.2 [83] data.table_1.14.2 png_0.1-7 [85] vctrs_0.4.1 testthat_3.1.4 [87] purrr_0.3.4 assertthat_0.2.1 [89] cachem_1.0.6 restfulr_0.0.13 [91] gargle_1.2.0 tibble_3.1.7 [93] GenomicAlignments_1.32.0 AnnotationDbi_1.58.0 [95] memoise_2.0.1 IRanges_2.30.0 [97] ellipsis_0.3.2 ```
bschilder commented 2 years ago

From the docs I can see:

In the following code x is a VCF object, and ... is a list of VCF objects.

rbind combines objects with different ranges (rowRanges) and the same subjects (columns in assays). Columns with duplicate names in colData must contain the same data. The ‘Samples’ columns in colData (created by readVcf) are renamed with a numeric extension ordered as they were input to rbind e.g., “Samples.1, Samples.2, ...” etc.

But not quite sure how this is supposed to work in practice.

For example

rbind(vcf1,vcf2)

Yields:

Error in validObject(ans) : invalid class “CollapsedVCF” object: 
'elementMetadata' slot must contain a zero-column DataFrame at all time

Running VariantAnnotation::expand() on each subset doesn't seem to help either (same error).

vjcitn commented 2 years ago

After

library(VariantAnnotation)
example(VCF)
vcf

we have

> vcf
class: CollapsedVCF 
dim: 4 0 
rowRanges(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 0 columns: 
geno(vcf):
  List of length 0: 
> rbind(vcf,vcf)
class: CollapsedVCF 
dim: 8 0 
rowRanges(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 0 columns: 
geno(vcf):
  List of length 0: 

which seems correct. I can understand that there may be exceptions in certain circumstances. Please give a reproducible example, with sessionInfo:

> sessionInfo()
R version 4.2.0 RC (2022-04-15 r82206)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] VariantAnnotation_1.42.0    Rsamtools_2.12.0           
 [3] Biostrings_2.64.0           XVector_0.36.0             
 [5] SummarizedExperiment_1.26.1 Biobase_2.56.0             
 [7] GenomicRanges_1.48.0        GenomeInfoDb_1.32.1        
 [9] IRanges_2.30.0              S4Vectors_0.34.0           
[11] MatrixGenerics_1.8.0        matrixStats_0.62.0         
[13] BiocGenerics_0.42.0         rmarkdown_2.14             

loaded via a namespace (and not attached):
 [1] httr_1.4.3               bit64_4.0.5              assertthat_0.2.1        
 [4] BiocFileCache_2.4.0      blob_1.2.3               BSgenome_1.64.0         
 [7] GenomeInfoDbData_1.2.8   yaml_2.3.5               progress_1.2.2          
[10] pillar_1.7.0             RSQLite_2.2.13           lattice_0.20-45         
[13] glue_1.6.2               digest_0.6.29            htmltools_0.5.2         
[16] Matrix_1.4-1             XML_3.99-0.9             pkgconfig_2.0.3         
[19] REDCapR_1.0.0            biomaRt_2.52.0           zlibbioc_1.42.0         
[22] purrr_0.3.4              BiocParallel_1.30.0      tibble_3.1.7            
[25] KEGGREST_1.36.0          generics_0.1.2           ellipsis_0.3.2          
[28] cachem_1.0.6             GenomicFeatures_1.48.0   cli_3.3.0               
[31] magrittr_2.0.3           crayon_1.5.1             memoise_2.0.1           
[34] evaluate_0.15            fansi_1.0.3              xml2_1.3.3              
[37] tools_4.2.0              prettyunits_1.1.1        hms_1.1.1               
[40] BiocIO_1.6.0             lifecycle_1.0.1          stringr_1.4.0           
[43] DelayedArray_0.22.0      AnnotationDbi_1.58.0     compiler_4.2.0          
[46] rlang_1.0.2              grid_4.2.0               RCurl_1.98-1.6          
[49] rjson_0.2.21             rappdirs_0.3.3           startup_0.17.0          
[52] bitops_1.0-7             restfulr_0.0.13          DBI_1.1.2               
[55] curl_4.3.2               R6_2.5.1                 GenomicAlignments_1.32.0
[58] knitr_1.39               dplyr_1.0.9              rtracklayer_1.56.0      
[61] fastmap_1.1.0            bit_4.0.4                utf8_1.2.2              
[64] filelock_1.0.2           stringi_1.7.6            parallel_4.2.0          
[67] Rcpp_1.0.8.3             vctrs_0.4.1              png_0.1-7               
[70] dbplyr_2.1.1             tidyselect_1.1.2         xfun_0.30   
bschilder commented 2 years ago

Thanks , @vjcitn I think you may have missed my first post with the reprex: https://github.com/Bioconductor/VariantAnnotation/issues/3#issuecomment-1119501289

Just added the session info to it too.

vjcitn commented 2 years ago

I think you don't have the searchlist set up properly.

> vcf <- VariantAnnotation::readVcf(fl)
> 
> vcf1 <- vcf[1:nrow(vcf)/2,]
> vcf2 <- vcf[(nrow(vcf)/2+1):nrow(vcf),]
> vcf_list <- list(vcf1, vcf2)
> 
> do.call(rbind, vcf_list)
class: CollapsedVCF 
dim: 6 3 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
      Number Type    Description                
   NS 1      Integer Number of Samples With Data
   DP 1      Integer Total Depth                
   AF A      Float   Allele Frequency           
   AA 1      String  Ancestral Allele           
   DB 0      Flag    dbSNP membership, build 129
   H2 0      Flag    HapMap2 membership         
geno(vcf):
  List of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
      Number Type    Description      
   GT 1      String  Genotype         
   GQ 1      Integer Genotype Quality 
   DP 1      Integer Read Depth       
   HQ 2      Integer Haplotype Quality
> rbind(vcf1, vcf2)
class: CollapsedVCF 
dim: 6 3 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
      Number Type    Description                
   NS 1      Integer Number of Samples With Data
   DP 1      Integer Total Depth                
   AF A      Float   Allele Frequency           
   AA 1      String  Ancestral Allele           
   DB 0      Flag    dbSNP membership, build 129
   H2 0      Flag    HapMap2 membership         
geno(vcf):
  List of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
      Number Type    Description      
   GT 1      String  Genotype         
   GQ 1      Integer Genotype Quality 
   DP 1      Integer Read Depth       
   HQ 2      Integer Haplotype Quality
> 

without the library call you don't have the necessary methods.

bschilder commented 2 years ago

rbind

I think you don't have the searchlist set up properly.

Could you clarify what you mean by "searchlist"? Do you just mean the list (vcf_list)?

without the library call you don't have the necessary methods.

Ah, I see. From a developers standpoint, I try to avoid importing entire libraries whenever possible, instead favoring @importFrom Roxygen syntax. So rather than loading the entire library, I would opt to use:

vcf_concat <- VariantAnnotation::rbind(vcf1, vcf2)
vcf_concat <-do.call( VariantAnnotation::rbind, list(vcf1, vcf2))

I can confirm that both these approaches work without any issues.

I think what was throwing me off is that usually you would get an error saying something like, "there is no method 'rbind' for class 'VCFCollapsed'". But because rbind is also in base R, it attempts to use base::rbind to bind the VCFs. Not sure if there's ways to avoid this and let users know the issue difference automatically.

For reference, the method is defined here: https://github.com/Bioconductor/VariantAnnotation/blob/e966a1b3cc3cede22c4d65fcada24be05206a65a/R/methods-VCF-class.R#L380

merge

Regarding the original question posted by @lawremi, it seems the MRC IEU team has implemented a version of this. VariantAnnotation might be able to integrate this directly into the package. https://github.com/MRCIEU/gwasvcf/blob/00337445ae67133517adf45aa2502800cf376283/R/manipulate.r#L87