JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.91k stars 5.49k forks source link

Feature request: Get encoding of `AbstractString` #52704

Open jakobnissen opened 11 months ago

jakobnissen commented 11 months ago

I do a lot with high-performance programming with strings. When you do that, it's often more efficient to work on the underlying bytes. Luckily, Julia enables that with codeunits.

However, there is no way of knowing generically, given some AbstractString, what encoding it uses - i.e, what the result of codeunits means. That makes it difficult (impossible?) to write code using codeunits for AbstractString. Most implementations of AbstractString uses UTF8, such as String, SubString{String}, StringView (of StringViews.jl), the various types in InlineStrings.jl, and more. But this is not generally true.

I propose to include a trait function encoding(::Type{<:AbstractString})::Symbol. In Base, the default implementations should be:

encoding(::Type{<:AbstractString}) = :unknown
encoding(::Type{String}) = :utf8
encoding(::Type{<:SubString{T}}) where T = encoding(T)
nsajko commented 11 months ago
  1. A more descriptive function name than encoding would be better IMO, maybe string_encoding?

  2. The return type should be a singleton type, not a Symbol. So, e.g., struct StringEncodingUTF8 end.

  3. Perhaps we could also have another trait that would say whether the encoding is known-valid (if no, it may need further validation/parsing).

jariji commented 11 months ago

Bikeshedding a bit - If String can be "UTF-8" without being valid UTF-8, then encoding is a little strong as a name. It seems like the idea is "ostensible/purported/assumed encoding", not actual encoding.

camilogarciabotero commented 11 months ago

I found some issues probably related:

And a discourse discussion from one of them:

https://discourse.julialang.org/t/what-is-the-interface-of-abstractstring/8937