Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

Misleading show() method for XStringSet objects #25

Open hpages opened 5 years ago

hpages commented 5 years ago

This is a follow up of https://support.bioconductor.org/p/122340/#122400

The show() method for XStringSet objects currently suggests the existence of a seq() getter for these objects:

library(Biostrings)
library(drosophila2probe)
dna <- DNAStringSet(drosophila2probe)
dna
#   A DNAStringSet instance of length 265400
#          width seq
#      [1]    25 CCTGAATCCTGGCAATGTCATCATC
#      [2]    25 ATCCTGGCAATGTCATCATCAATGG
#      [3]    25 ATCAGTTGTCAACGGCTAATACGCG
#      [4]    25 ATCAATGGCGATTGCCGCGTCTGCA
#      [5]    25 CCGCGTCTGCAATGTGAGGGCCTAA
#      ...   ... ...
# [265396]    25 TACTACTTGAGCCACAACCATCTGA
# [265397]    25 AGGGACTAAAGAGGCCCCATGCTCT
# [265398]    25 CATGCTCTGTCTGGTGTCAGCGCTA
# [265399]    25 GTCAGCGCTACATGGTCCAGGACAA
# [265400]    25 CCAGGACAAGTATGGACTTCCCCAC

but there is no such getter.

Same issue with the show() method for XString objects:

dna[[1]]
#  25-letter "DNAString" instance
# seq: CCTGAATCCTGGCAATGTCATCATC

Also it would be good to make these show() methods more consistent with other show() methods in S4Vectors/IRanges/GenomicRanges:

library(IRanges)
IRanges(1:3, 10, names=LETTERS[1:3], score=runif(3))
# IRanges object with 3 ranges and 1 metadata column:
#         start       end     width |             score
#     <integer> <integer> <integer> |         <numeric>
#   A         1        10        10 | 0.267148569226265
#   B         2        10         9 | 0.106218574102968
#   C         3        10         8 | 0.649568639695644

In particular the names on a DNAStringSet object should be displayed on the left. Also its metadata columns should be displayed (right now they are not):

dna2 <- dna[1:3]
names(dna2) <- LETTERS[1:3]
mcols(dna2)$score <- runif(3)
dna2
#   A DNAStringSet instance of length 3
#     width seq                                               names               
# [1]    25 CCTGAATCCTGGCAATGTCATCATC                         A
# [2]    25 ATCCTGGCAATGTCATCATCAATGG                         B
# [3]    25 ATCAGTTGTCAACGGCTAATACGCG                         C
mtmorgan commented 5 years ago

Somehow related is the initial value displayed for mcols()

> mcols(DNAStringSet())
NULL
> mcols(GRanges())
DataFrame with 0 rows and 0 columns
hpages commented 5 years ago

This has not much to do with the show() method but with the fact that the mcols() are allowed to be NULL for some Vector derivatives like Hits, Rle, IRanges, DNAStringSet, etc... For other Vector derivatives like GRanges, GRangesList, SummarizedExperiment, etc... mcols() is forced to be a DataFrame. An inconsistency situation that we should discuss in a different issue if we think it should be addressed.

FelixErnst commented 5 years ago

There also some other inconsistencies for showing the name of elements. The length of names seems to be treated differently. Probably a historic reason based on the positioning of the names (left vs. right.)

library(Biostrings)
library(GenomicRanges)
seq <- RNAStringSet(c("UAUCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUAAGCCAUGCAUGUCUAAGUAUAAGCAAUUUAUACAGUGAAACUGCGAAUGGCUCA",
                      "CCGAGAGGUCUUGGUAAUCUUGUGAAACUCCGUCGUGCUGGGGAUAGAGCAUUGUAAUUAUUGCUCUUCAACGAGGAAUUCCUAGUAAGCGCAAGUCAUCA"))
names(seq) <- c("TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter",
                "TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter")
gr <- GRanges(c("chr1:5-10:+","chr1:6-10:+"))
names(gr) <- names(seq)
seq
#>   A RNAStringSet instance of length 2
#>     width seq                                          names               
#> [1]   100 UAUCUGGUUGAUCCUGCCAGU...GUGAAACUGCGAAUGGCUCA TheFirstVeryLongN...
#> [2]   101 CCGAGAGGUCUUGGUAAUCUU...CUAGUAAGCGCAAGUCAUCA TheSecondVeryLong...
gr
#> GRanges object with 2 ranges and 0 metadata columns:
#>                                                            seqnames
#>                                                               <Rle>
#>    TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter     chr1
#>   TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter     chr1
#>                                                               ranges
#>                                                            <IRanges>
#>    TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter      5-10
#>   TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter      6-10
#>                                                            strand
#>                                                             <Rle>
#>    TheFirstVeryLongNameAndItIsGettingEvenLongerByTheLetter      +
#>   TheSecondVeryLongNameAndItIsGettingEvenLongerByTheLetter      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
hpages commented 5 years ago

Right, long names are truncated. But maybe that's a good thing and we should keep that when we move them to the left. I don't know.

Yeah, these things predate GRanges. The show() methods for XStringSet, XStringViews, and XString objects are actually my first show() methods ever. I implemented them more than 13 years ago when I took over the refactoring and maintenance of Biostrings. At that time we didn't have any of the IRanges, GenomicRanges, or S4Vectors packages yet.