gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Which functions should preserve objects' attributes? #59

Open gagolews opened 10 years ago

gagolews commented 10 years ago

Currently the only function that preserves a selected subset of the input object's attibutes is stri_sort (see #63)

Which other functions should preserve the attributes? Which attributes should be preserved (names, ...)? If there are > 1 parameters, what shall be the attribute selection strategy?

gagolews commented 10 years ago

dim, names and dimnames? see mostattributes in ?attributes

hadley commented 9 years ago

It feels like stri_replace_* should definitely keep all attributes since it's reasonable to think of it modifying the contents of an existing vetor.

stri_sort() is harder. In base R:

a <- matrix(1:6, nrow = 3)
sort(a)
#> [1] 1 2 3 4 5 6
b <- c(x = 2, y = 1)
sort(b)
#> y x 
#> 1 2

And ?sort has:

All attributes are removed from the return value (see Becker et al, 1988, p.146) except names, which are sorted. (If partial is specified even the names are removed.) Note that this means that the returned value has no class, except for factors and ordered factors (which are treated specially and whose result is transformed back to the original class).

hadley commented 9 years ago

Maybe the only attribute that should be systematically preserved is names? The trickiest case is *_all(simplify = TRUE) where the names would become row names.

t-kalinowski commented 8 years ago

Should stri_trim (and friends) preserve matrix's? (like base::trimws)

I recently had this type of use case:

library(readr)
library(stringr)
read_lines(readr_example("mtcars.csv")) %>%
  stringr::str_split_fixed(",", n = 11) %>% #returns a matrix
 #stringr::str_trim()  # returns a vector, not wanted
  base::trimws()  # returns a matrix
Tazinho commented 7 years ago

Here is another usecase, where it makes sense to preserve names: https://github.com/Tazinho/snakecase/issues/93

I decided to preserve them within the snakecase package now, but noticed that this is not consistent with stringr::str_to_lower etc. and the underlying stringi functions. So I'd like to suggest this change at least for stringi::stri_trans_tolower(), stringi::stri_trans_totitle(), stringi::stri_trans_toupper().

econandrew commented 6 years ago

Came here from referred issue above, and agree w @hadley's suggestion to at least preserve names. I hit this when using a named vector as input to labels in a ggplot2 scale, which will match by name if the vector is named. I find this a pretty useful feature in general.

However, when I decided to stringr::str_wrap the labels, it drops the names and fails silently 😭 as scale_* falls back to vector order, matching the wrong labels to the data.

hadley commented 6 years ago

I think the only attributes you need to preserve are names. You could choose to preserve dims, but they are rarely used (and would require more thought). You don't know how to any other attributes relate to the data, so it's best to leave to the class author to handle with S3 dispatch (i.e. fixing for general objects would require stringi functions to become S3 generics, which (IMO) is outside the scope of this issue.

stri_sort() and stri_subset() could use [, but don't. I think that's a reasonable design choice (favouring performance over S3 dispatch), so you could just document that if you want to preserve class/attributes, you should combine stri_order() and stri_detect() with [ yourself.