colega / unexpected-go

Unexpected Golang behaviors
MIT License
46 stars 7 forks source link

Does len(string) might be a candidate? #4

Open joanlopez opened 4 years ago

joanlopez commented 4 years ago

Even though it's properly documented (see here), IMHO the behaviour of the len built-in method for the case when the parameter is a string is a candidate for this repository.

I assume there is a fair enough amount of reasons behind that could explain us why this behaviour was chosen, but I'd say that using the amount of bytes for strings is quite confusing.

Example:

len("si") // 2
len("sí") // 3
len("世界") // 6

So, as discussed here, the proper way to get the amount of characters within a given string is by using len([]rune(string)):

len([]rune("si")) // 2
len([]rune("sí")) // 2
len([]rune("世界")) // 2

Additionally, I'd say it could be interesting to open a new Go's proposal to include a wrapper function on the strings package. If the Go's spirit is keeping it simple & keeping backwards compatibility, I'd keep the len behaviour but I'd add that method, as the rune hack is not simple at all.

PS: I'm not really really sure if the proposed method already exists, I only did a quick look up 😇

colega commented 4 years ago

Hi, sorry for the late response. I think that len(string) doesn't fit well here.

Although it might be confusing for people coming from other programming languages like Python (where unicode runes are counted by default), it is consistent in Golang, and I think that this paragraphs from the official docs let it clear:

In Go, a string is in effect a read-only slice of bytes. If you're at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we'll assume here that you have.

It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

So strings are just slices of bytes, and they know nothing about unicode, thus their behaviour is consistent with bytes slices.

Also note, that len([]rune(...)) operation is not just a different syntax, it's an operation where the cost increases from O(1) to O(n) so making that explicit in the code is always good.

joanlopez commented 4 years ago

Sure, fair enough! Thanks for your time 🙏 Happy to keep learn everyday 😇

colega commented 3 years ago

I'm reopening this as I feel we can have a page for strings behaviour.

While len(unicodeString) is not unexpected enough, IMO, the whole set of unicode-bytes duality of strings can be definitely documented, especially, as @joanlopez pointed out in a private conversation at some point, that for loop iterates runes but provides bytes indexes: https://play.golang.org/p/lEYcSV4Btgh

joanlopez commented 3 years ago

Additional context here.

colega commented 4 months ago

Just as a heads-up, I just hit a bug because of this.