Closed stephenturner closed 1 year ago
Just about to start a small repo for the motifbreakR variant scoring pipeline, so I'll add the short-term workaround there for now. Repo here: https://github.com/colossal-compsci/motifbreakR-pipeline
FWIW I ended up skipping snps.from.file()
entirely. In the end, all motifbreakR needs at minimum is a GRanges object with metadata SNP_id
, REF
, and ALT
and attributes(vcf)$genome.package
to be set to the BSGenome string. So, I used the Bioc VariantAnnotation
library to load the VCF and made sure the metadata are compatible. Added bonus of filtering variants in R rather than generating files outside R with bcftools.
Also, in case it is helpful, seqlevelsStyle(vcf) <- "NCBI"
and seqlevelsStyle(vcf) <- "UCSC"
can be used to swap between no "chr" and "chr" prefixes of GRange objects easily and IIRC it doesn't prepend "chr" to inappropriate contig names like how motifbreakR coded it.
Late to this thread, but is this what you're doing?
library(GenomicFeatures)
library(VariantAnnotation)
v <- readVcf("/path/to/my.vcf.gz") |> rowRanges()
mcols(v)$SNP_id <- rownames(mcols(v))
attributes(v)$genome.package <- "BSgenome.Whatever"
Yes, as well as
mcols(v)$ALT = unlist(mcols(v)$ALT)
because motifbreakR takes issue with the mcol type for ALT allowing multiallelic options (DNAStringList
) which seems might be default from readVcf for ALT field
There is a bug in motifbreakR (Simon-Coetzee/motifBreakR#52) which adds a "chr" prefix in cases where it probably shouldn't, making the VCF being processed incompatible with the BSGenome because seqnames differ.
A temporary workaround for this is to add "chr" prefixes to the bsgenome object, then removing the "chr" prefixes after reading in the VCF.
Long-term fixes:
snps.from.file()
function inside ofread_vcf()
and remove this offending line of code (preferable, but will take time).Cc @tabbzi @greggedman