Open lawremi opened 6 years ago
Sounds like a good idea and you have some specific criteria in mind. Please feel free to implement or submit a pull request.
Basically put this here to record it. Not to imply that anyone in particular should work on it.
I was also searching for a VCF merge functionality in R. This might be helpful for windows users who have a more difficult access to the Linux bash bcftools merge
command.
I would also find this functionality helpful. In my case, I'm simply trying to concatenate a list of VCFs that are actually just one VCF that's been chunked.
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- VariantAnnotation::readVcf(fl)
vcf1 <- vcf[1:nrow(vcf)/2,]
vcf2 <- vcf[(nrow(vcf)/2+1):nrow(vcf),]
vcf_list <- list(vcf1, vcf2)
# Something like this?
## vcf_reconstructed <- do.call("rbind",vcf_list)
Thanks!, Brian
From the docs I can see:
In the following code x is a VCF object, and ... is a list of VCF objects.
rbind combines objects with different ranges (rowRanges) and the same subjects (columns in assays). Columns with duplicate names in colData must contain the same data. The ‘Samples’ columns in colData (created by readVcf) are renamed with a numeric extension ordered as they were input to rbind e.g., “Samples.1, Samples.2, ...” etc.
But not quite sure how this is supposed to work in practice.
For example
rbind(vcf1,vcf2)
Yields:
Error in validObject(ans) : invalid class “CollapsedVCF” object:
'elementMetadata' slot must contain a zero-column DataFrame at all time
Running VariantAnnotation::expand()
on each subset doesn't seem to help either (same error).
After
library(VariantAnnotation)
example(VCF)
vcf
we have
> vcf
class: CollapsedVCF
dim: 4 0
rowRanges(vcf):
GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 0 columns:
geno(vcf):
List of length 0:
> rbind(vcf,vcf)
class: CollapsedVCF
dim: 8 0
rowRanges(vcf):
GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 0 columns:
geno(vcf):
List of length 0:
which seems correct. I can understand that there may be exceptions in certain circumstances. Please give a reproducible example, with sessionInfo:
> sessionInfo()
R version 4.2.0 RC (2022-04-15 r82206)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] VariantAnnotation_1.42.0 Rsamtools_2.12.0
[3] Biostrings_2.64.0 XVector_0.36.0
[5] SummarizedExperiment_1.26.1 Biobase_2.56.0
[7] GenomicRanges_1.48.0 GenomeInfoDb_1.32.1
[9] IRanges_2.30.0 S4Vectors_0.34.0
[11] MatrixGenerics_1.8.0 matrixStats_0.62.0
[13] BiocGenerics_0.42.0 rmarkdown_2.14
loaded via a namespace (and not attached):
[1] httr_1.4.3 bit64_4.0.5 assertthat_0.2.1
[4] BiocFileCache_2.4.0 blob_1.2.3 BSgenome_1.64.0
[7] GenomeInfoDbData_1.2.8 yaml_2.3.5 progress_1.2.2
[10] pillar_1.7.0 RSQLite_2.2.13 lattice_0.20-45
[13] glue_1.6.2 digest_0.6.29 htmltools_0.5.2
[16] Matrix_1.4-1 XML_3.99-0.9 pkgconfig_2.0.3
[19] REDCapR_1.0.0 biomaRt_2.52.0 zlibbioc_1.42.0
[22] purrr_0.3.4 BiocParallel_1.30.0 tibble_3.1.7
[25] KEGGREST_1.36.0 generics_0.1.2 ellipsis_0.3.2
[28] cachem_1.0.6 GenomicFeatures_1.48.0 cli_3.3.0
[31] magrittr_2.0.3 crayon_1.5.1 memoise_2.0.1
[34] evaluate_0.15 fansi_1.0.3 xml2_1.3.3
[37] tools_4.2.0 prettyunits_1.1.1 hms_1.1.1
[40] BiocIO_1.6.0 lifecycle_1.0.1 stringr_1.4.0
[43] DelayedArray_0.22.0 AnnotationDbi_1.58.0 compiler_4.2.0
[46] rlang_1.0.2 grid_4.2.0 RCurl_1.98-1.6
[49] rjson_0.2.21 rappdirs_0.3.3 startup_0.17.0
[52] bitops_1.0-7 restfulr_0.0.13 DBI_1.1.2
[55] curl_4.3.2 R6_2.5.1 GenomicAlignments_1.32.0
[58] knitr_1.39 dplyr_1.0.9 rtracklayer_1.56.0
[61] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2
[64] filelock_1.0.2 stringi_1.7.6 parallel_4.2.0
[67] Rcpp_1.0.8.3 vctrs_0.4.1 png_0.1-7
[70] dbplyr_2.1.1 tidyselect_1.1.2 xfun_0.30
Thanks , @vjcitn I think you may have missed my first post with the reprex: https://github.com/Bioconductor/VariantAnnotation/issues/3#issuecomment-1119501289
Just added the session info to it too.
I think you don't have the searchlist set up properly.
> vcf <- VariantAnnotation::readVcf(fl)
>
> vcf1 <- vcf[1:nrow(vcf)/2,]
> vcf2 <- vcf[(nrow(vcf)/2+1):nrow(vcf),]
> vcf_list <- list(vcf1, vcf2)
>
> do.call(rbind, vcf_list)
class: CollapsedVCF
dim: 6 3
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
Number Type Description
NS 1 Integer Number of Samples With Data
DP 1 Integer Total Depth
AF A Float Allele Frequency
AA 1 String Ancestral Allele
DB 0 Flag dbSNP membership, build 129
H2 0 Flag HapMap2 membership
geno(vcf):
List of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
GQ 1 Integer Genotype Quality
DP 1 Integer Read Depth
HQ 2 Integer Haplotype Quality
> rbind(vcf1, vcf2)
class: CollapsedVCF
dim: 6 3
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
Number Type Description
NS 1 Integer Number of Samples With Data
DP 1 Integer Total Depth
AF A Float Allele Frequency
AA 1 String Ancestral Allele
DB 0 Flag dbSNP membership, build 129
H2 0 Flag HapMap2 membership
geno(vcf):
List of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
GQ 1 Integer Genotype Quality
DP 1 Integer Read Depth
HQ 2 Integer Haplotype Quality
>
without the library call you don't have the necessary methods.
I think you don't have the searchlist set up properly.
Could you clarify what you mean by "searchlist"? Do you just mean the list (vcf_list
)?
without the library call you don't have the necessary methods.
Ah, I see.
From a developers standpoint, I try to avoid importing entire libraries whenever possible, instead favoring @importFrom
Roxygen syntax.
So rather than loading the entire library, I would opt to use:
vcf_concat <- VariantAnnotation::rbind(vcf1, vcf2)
vcf_concat <-do.call( VariantAnnotation::rbind, list(vcf1, vcf2))
I can confirm that both these approaches work without any issues.
I think what was throwing me off is that usually you would get an error saying something like, "there is no method 'rbind' for class 'VCFCollapsed'". But because rbind
is also in base R, it attempts to use base::rbind
to bind the VCFs. Not sure if there's ways to avoid this and let users know the issue difference automatically.
For reference, the method is defined here: https://github.com/Bioconductor/VariantAnnotation/blob/e966a1b3cc3cede22c4d65fcada24be05206a65a/R/methods-VCF-class.R#L380
Regarding the original question posted by @lawremi, it seems the MRC IEU team has implemented a version of this. VariantAnnotation
might be able to integrate this directly into the package.
https://github.com/MRCIEU/gwasvcf/blob/00337445ae67133517adf45aa2502800cf376283/R/manipulate.r#L87
It would be nice to have an efficient utility to merge two VCF objects that have differing (but often overlapping) sets of variants and samples. I guess the assay matrices would be filled with
NA
s. Maybe assume that therowData()
andcolData()
have the same columns, for now. Also assume that only the ranges and alt need to match (like VRanges matching). Currently we are using bcftools for this, but it would be nice to move to a more integrated workflow.I guess a simple implementation would be to coerce the two (or more) VCF objects to VRanges objects,
c()
them together and then coerce back to VCF, preserving theinfo()
andgeno()
distinction.