Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

'new2old' not supported in 'seqinfo<-,DNAStringSet-method' #34

Closed susheelbhanu closed 2 years ago

susheelbhanu commented 2 years ago

Hi,

I'm trying to use the package in workflow, with the following code:

len <- 91
dna <- readDNAStringSet("results/genomepy/GRCh38.p13/GRCh38.p13.fa")

names(dna) <- sapply(strsplit(names(dna), " "), .subset, 1)

get_velocity_files(
        X = "results/genomepy/GRCh38.p13/GRCh38.p13.annotation.gtf",
        L = len,
        Genome = dna,
        out_path = "results/busparse/get_velocity_files/GRCh38.p13",
        style = "Ensembl",
        transcript_version = NULL,
        gene_version = NULL
    )

And I get the following error:

Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges
Loading required package: XVector
Loading required package: GenomeInfoDb

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Assuming that all chromosomes are linenar.
Error: 'new2old' not supported in 'seqinfo<-,DNAStringSet-method'
Execution halted

Any help would be appreciated. Thank you, Susheel

LiNk-NY commented 2 years ago

Hi Susheel, @susheelbhanu

Can you please provide a minimally reproducible example? Are you using one of our workflow packages?

Best, Marcel

hpages commented 2 years ago

Here's an MRE:

library(Biostrings)
dna <- DNAStringSet(c(chr1="AAA", chr2="TTTTT", chrM="GG"))
seqlevelsStyle(dna)
# [1] "UCSC"
seqlevelsStyle(dna) <- "Ensembl"
# Error: 'new2old' not supported in 'seqinfo<-,DNAStringSet-method'
LiNk-NY commented 2 years ago

Thanks Hervé for the MRE. While we work on a long term solution, perhaps we can add a seqlevelsStyle setter method to the class?

setReplaceMethod("seqlevelsStyle", "DNAStringSet",
    function(x, value) {
        seqs <- seqinfo(x)
        seqlevelsStyle(seqs) <- value
        names(x) <- names(seqs)
        x
    }
)
hpages commented 2 years ago

I think a better approach is to make the seqinfo() setter a little bit smarter. There is no reason why it wouldn't accept a new2old value that is set to seq_along() of the supplied Seqinfo. I'm working on that.

hpages commented 2 years ago

I ended up doing a substantial refactoring of the seqinfo() getter and setter for DNAStringSet objects. Only the circularity flags and genome info are now stored in the metadata() of the object. The seqnames() and seqlengths() are just inferred from the names() and width().

The new behavior of the setter now allows it to alter everything (i.e. names/seqnames and/or circularity flags and/or genome) except the widths/seqlengths of the sequences. Also now new2old can be set to seq_along(x).

We still have the long-time problem that subsetting the DNAStringSet object breaks its seqinfo() but that's for another day.

See https://github.com/Bioconductor/Biostrings/commit/76279b5b32c146425a73f593fa40b2b790b81565 for the details.