erlang / eep

Erlang Enhancement Proposals
http://www.erlang.org/erlang-enhancement-proposals/
264 stars 67 forks source link

Create eep-0063.md: Lightweight UTF-8 binary string literals and patterns #46

Closed TD5 closed 11 months ago

josevalim commented 1 year ago

Btw, have you considered using u (for utf-8) instead of b? Thoughts?

okeuday commented 1 year ago

@TD5 It is possible to create a bytestring type as:

-type nonempty_bytestring() :: nonempty_list(byte()).
-type bytestring() :: list(byte()).

Adding a bytestring type into Erlang/OTP would be helpful, as part of this. If the compiler knew it was UTF-8, it could have a special type separate from a bytestring, but similar (like utf8string).

TD5 commented 1 year ago

Btw, have you considered using u (for utf-8) instead of b? Thoughts?

I hadn't considered it, but it sounds reasonable. I don't myself have a preference. Either could be fine in my view. Why might u be better? Because it implies the specific encoding? Because it aligns with the syntax of other languages with similar features?

josevalim commented 1 year ago

No particular reason in isolation but I think it matters when it comes to concepts like interpolation, because you need a stronger indicator to know if you are interpolating a list of bytes or a list of characters and I believe the u"..." sigil makes the latter clear.

jchristgit commented 1 year ago

I hadn't considered it, but it sounds reasonable. I don't myself have a preference. Either could be fine in my view. Why might u be better? Because it implies the specific encoding? Because it aligns with the syntax of other languages with similar features?

Not sure how relevant Python is for Erlang here, but its UTF-8 string literals (when introduced) are written u"like this" while binary string literals are written b"like this" (in Python 3 u more or less became the default):

>>> type("foo")
<class 'str'>
>>> type(b"foo")
<class 'bytes'>
>>> type(u"foo")
<class 'str'>

For Erlang in theory I think both would fit - it is UTF-8 and the binary type - but when I think "bytes", I think some binary data that goes over the wire - when I think "UTF-8", I think some user-facing string. So as a small outside voice I'd vote for u here 🙂 Maybe b could be used for plain binary "strings" without /utf8.

paulnice commented 11 months ago

For myself, bytes could contain any binary string, while utf8 must contain valid utf8 string/bytestring. It is possible to have a valid bytestring, which represents invalid utf8 string at the same time.

So I'd prefer to have u literal for utf-8

RaimoNiskanen commented 11 months ago

Has EEP 66 (now PR #55) obsoleted this PR?

TD5 commented 11 months ago

Has EEP 66 (now PR #55) obsoleted this PR?

I believe so 🙂

TD5 commented 11 months ago

Actually, I am now sure that covers patterns, only literals?

RaimoNiskanen commented 11 months ago

Actually, I am now sure that covers patterns, only literals?

Did you mean "not sure"?

It isn't stated in EEP 66, but sigils are a syntactical sugar (transformation) that happens before the parser tries to figure out what is a pattern.

In general the parser may transform a sigil into any expression, for instance for string interpolation call a formatter. Then subsequent compilation steps will see that it cannot be in a pattern. But for the suggested ~b, ~B, ~s, ~S and ~ sigil prefixes, the content is just transformed into another literal, which is allowed in a pattern.

~I can clarify this in EEP 66.~ Edit: I have clarified this in EEP 66, or rather PR #55.

TD5 commented 11 months ago

Yep, I meant "not sure", but I glad to heard this is handled now 🙂