Closed TD5 closed 11 months ago
@TD5 It is possible to create a bytestring type as:
-type nonempty_bytestring() :: nonempty_list(byte()).
-type bytestring() :: list(byte()).
Adding a bytestring type into Erlang/OTP would be helpful, as part of this. If the compiler knew it was UTF-8, it could have a special type separate from a bytestring, but similar (like utf8string
).
Btw, have you considered using
u
(for utf-8) instead ofb
? Thoughts?
I hadn't considered it, but it sounds reasonable. I don't myself have a preference. Either could be fine in my view. Why might u
be better? Because it implies the specific encoding? Because it aligns with the syntax of other languages with similar features?
No particular reason in isolation but I think it matters when it comes to concepts like interpolation, because you need a stronger indicator to know if you are interpolating a list of bytes or a list of characters and I believe the u"..."
sigil makes the latter clear.
I hadn't considered it, but it sounds reasonable. I don't myself have a preference. Either could be fine in my view. Why might
u
be better? Because it implies the specific encoding? Because it aligns with the syntax of other languages with similar features?
Not sure how relevant Python is for Erlang here, but its UTF-8 string literals (when introduced) are written u"like this"
while binary string literals are written b"like this"
(in Python 3 u
more or less became the default):
>>> type("foo")
<class 'str'>
>>> type(b"foo")
<class 'bytes'>
>>> type(u"foo")
<class 'str'>
For Erlang in theory I think both would fit - it is UTF-8 and the binary type - but when I think "bytes", I think some binary data that goes over the wire - when I think "UTF-8", I think some user-facing string. So as a small outside voice I'd vote for u
here 🙂 Maybe b
could be used for plain binary "strings" without /utf8
.
For myself, bytes
could contain any binary string, while utf8
must contain valid utf8 string/bytestring.
It is possible to have a valid bytestring, which represents invalid utf8 string at the same time.
So I'd prefer to have u
literal for utf-8
Has EEP 66 (now PR #55) obsoleted this PR?
Has EEP 66 (now PR #55) obsoleted this PR?
I believe so 🙂
Actually, I am now sure that covers patterns, only literals?
Actually, I am now sure that covers patterns, only literals?
Did you mean "not sure"?
It isn't stated in EEP 66, but sigils are a syntactical sugar (transformation) that happens before the parser tries to figure out what is a pattern.
In general the parser may transform a sigil into any expression, for instance for string interpolation call a formatter. Then subsequent compilation steps will see that it cannot be in a pattern. But for the suggested ~b
, ~B
, ~s
, ~S
and ~
sigil prefixes, the content is just transformed into another literal, which is allowed in a pattern.
~I can clarify this in EEP 66.~ Edit: I have clarified this in EEP 66, or rather PR #55.
Yep, I meant "not sure", but I glad to heard this is handled now 🙂
Btw, have you considered using
u
(for utf-8) instead ofb
? Thoughts?