colossal-compsci / tfboot

R package for bootstrapping motifbreakR results
https://colossal-compsci.github.io/tfboot/
4 stars 3 forks source link

motifbreakR bug: "chr" prefix added #10

Closed stephenturner closed 1 year ago

stephenturner commented 1 year ago

There is a bug in motifbreakR (Simon-Coetzee/motifBreakR#52) which adds a "chr" prefix in cases where it probably shouldn't, making the VCF being processed incompatible with the BSGenome because seqnames differ.

A temporary workaround for this is to add "chr" prefixes to the bsgenome object, then removing the "chr" prefixes after reading in the VCF.

bsg <- BSgenome.Myfav.Ensembl.mFavGenome42

seqlevels(bsg)
if (!any(grepl("chr", seqlevels(bsg)))) {
  seqlevels(bsg) <- paste0("chr", seqlevels(bsg))
}
bsg

snps <- read_vcf(file="my.vcf.gz", bsgenome=bsg)

if (any(grepl("chr", seqlevels(snps)))) {
  seqlevels(snps) <- gsub("^chr", "", seqlevels(snps))
}

Long-term fixes:

Cc @tabbzi @greggedman

tabbzi commented 1 year ago

Just about to start a small repo for the motifbreakR variant scoring pipeline, so I'll add the short-term workaround there for now. Repo here: https://github.com/colossal-compsci/motifbreakR-pipeline

tabbzi commented 1 year ago

FWIW I ended up skipping snps.from.file() entirely. In the end, all motifbreakR needs at minimum is a GRanges object with metadata SNP_id, REF, and ALT and attributes(vcf)$genome.package to be set to the BSGenome string. So, I used the Bioc VariantAnnotation library to load the VCF and made sure the metadata are compatible. Added bonus of filtering variants in R rather than generating files outside R with bcftools.

Also, in case it is helpful, seqlevelsStyle(vcf) <- "NCBI" and seqlevelsStyle(vcf) <- "UCSC" can be used to swap between no "chr" and "chr" prefixes of GRange objects easily and IIRC it doesn't prepend "chr" to inappropriate contig names like how motifbreakR coded it.

stephenturner commented 1 year ago

Late to this thread, but is this what you're doing?

library(GenomicFeatures)
library(VariantAnnotation)
v <- readVcf("/path/to/my.vcf.gz") |> rowRanges()
mcols(v)$SNP_id <- rownames(mcols(v))
attributes(v)$genome.package <- "BSgenome.Whatever"
tabbzi commented 1 year ago

Yes, as well as mcols(v)$ALT = unlist(mcols(v)$ALT) because motifbreakR takes issue with the mcol type for ALT allowing multiallelic options (DNAStringList) which seems might be default from readVcf for ALT field