Closed heri16 closed 2 years ago
It's the nature of Erlang. Here is my explanation to the same question https://github.com/halturin/ergo/issues/44#issuecomment-829329054
Yes, but other libraries like https://github.com/kbrw/node_erlastic or https://github.com/rusterlium/rustler allows to configure how they are decoded to suit Elixir semantics/convention?
I would say it's really bad practice.
Does it mean that this library will not have built in support for UTF-8 strings from Elxiir?
It's not related to UTF support. Strings in Golang are immutable. So if you got a binary and you want to treat it as a string just cast it to the string type.
The below seems to result in two different encodings on the receiver side.
This sends {[104, 105, 32, 230, 151, 165, 230, 156, 172, 32, 240, 159, 154, 128]}
over the wire.
reply := etf.Term(etf.Tuple{etf.Atom("error"), etf.Atom("unknown_request")})
reply = etf.Tuple{"hi 日本語 🚀"}
return "reply", reply, state
This sends {"hi 日本語 🚀"}
over the wire.
reply := etf.Term(etf.Tuple{etf.Atom("error"), etf.Atom("unknown_request")})
reply = etf.Tuple{[]byte("hi 日本語 🚀")}
return "reply", reply, state
Does every nested possibly UTF-8 string in reply need to be casted to []byte(str)?
Golang string type are natively UTF-8, sending them as charlists (that are encoded incorrectly) seems counter-intuitive.
It's not a problem of Ergo :). Go is a strongly, statically typed language. It means the Ergo encoder knows for sure what exact type of data it trying to encode. Any string will be encoded as a string type. On the Erlang side - there is magic in the air :).
Golang string type are natively UTF-8, sending them as charlists (that are encoded incorrectly)
not sure if I follow you here
See https://blog.golang.org/strings
string = readonly []byte
Strings in golang are semantically equivalent to binaries in Elixir/Erlang.
The ergo etf encoder encodes them as charlists, which produces invalid output (that cannot be decoded).
Any good encoder should consider the semantics and underlying layout of the golang environment, yes?
See how unicode is handled in Erlang:
https://erlang.org/doc/apps/stdlib/unicode_usage.html#the-interactive-shell
As the UTF-8 encoding is widely spread and provides some backward compatibility in the 7-bit ASCII range, it is selected as the standard encoding for Unicode characters in binaries for Erlang.
ergo tries too hard to encode go string into charlist. When it should be just encoding string into binary. (Like most other libraries in golang ecosystem, written by experienced teams that appreciates the modern semantics and underlying layout of the golang types).
The actual charlist encoded by ergo eft is...
iex> to_string([104, 105, 32, 230, 151, 165, 230, 156, 172, 32, 240, 159, 154, 128])
<<104, 105, 32, 195, 166, 194, 151, 194, 165, 195, 166, 194, 156, 194, 172, 32,
195, 176, 194, 159, 194, 154, 194, 128>>
The correct encoding marshalled by ergo eft should have been...
iex> to_string([104, 105, 32, 26085, 26412, 35486, 32, 128640])
"hi 日本語 🚀"
Ultimately there should be a way to configure or disable this "Heuristic String Detection" when handling lists from Erlang/Elixir.
And maybe a way to also configure or disable this "Heuristic List Encoding" when handling unicode strings from Go.
Erlang side - there is magic in the air :).
Regrettably, magic doesn't really work out for us. 🥇
I finally got the point ) sorry for the misunderstanding. Working on an improvement of handling Erlang/elixir charlist strings. (there will be a struct tag "charlist")
done. pushed to the master. please, let me know if you find any issue with that.
for sending charlist from the Ergo to Erlang it should be explicitly defined as a struct tag 'charlist'
type Struct struct {
A string `etf:"fieldA charlist"`
}
on an Erlang side, it will be a map like this
#{'fieldA' => "Hello World! 🚀"}
handle received "term" from the Erlang side should be used TermMapIntoStruct function
a := Struct{}
TermMapIntoStruct(term, &a)
or for the Tuple value
{ "Hello World! 🚀"}
should be used TermIntoStruct
a := Struct{}
TermIntoStruct(term, &a)
Thanks for the commit! I've upgraded to latest version (v1.2.5-0.20210731234859-3217bf775f6e) from master branch.
However, the below still sends a charlist instead of a binary (https://erlang.org/doc/apps/erts/erl_ext_dist.html#bit_binary_ext)
reply = etf.Tuple{"hi 日本語 🚀"}
return "reply", reply, state
iex> GenServer.call({ :example, :'demo@127.0.0.1' }, :hello)
{[104, 105, 32, 230, 151, 165, 230, 156, 172, 232, 170, 158, 32, 240, 159, 154, 128]}
Is there a way to configure this, when sending data from golang to erlang, considering that golang strings are just []byte ?
charlist - is a struct tag :) it must be applied to the struct field
I think there is a mix-up between this issue and the other one: https://github.com/halturin/ergo/issues/58
This issue is about disabling "Heuristic String Detection" in ergo that results in native strings in golang being encoded into etf List, while the other issue #58 is about Structs annotations/tags.
As mentioned by Erlang documentation itself: "String does not have a corresponding Erlang representation"
A golang string is not equal to an ETF string.
A golang string is equal to an ETF bitstring (binary).
See: https://blog.golang.org/strings - "It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."
See: https://erlang.org/doc/apps/erts/erl_ext_dist.html#binary_ext - "This term represents a bitstring whose length in bits have to be a multiple of 8 bits."
we can not just enable/disable the conversion of 'charlist' to the string and back as it affects the whole node. The only way to do this for the specific data is using a struct tag. It means to send 'charlist' from the golang side you should use the struct with 'charlist' tag
type Struct struct {
A string `etf:"fieldA charlist"`
}
reply := Struct{"hi 日本語 🚀"}
return "reply", reply, state
See: https://blog.golang.org/strings - "It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."
The intention is to send all golang strings as an ETF bitstring / binary, instead of a charlist.
The node logic could use etf.String instead of golang string.
Add etf.String to etf.go
:
type Atom string
type String string
I would suggest making this way for the charlist as well
type Charlist string // encodes as a List
type String string // encodes as a binary
TermToStruct/TermMapToStruct will be updated accordingly - detect destination type and convert from List to the Charlist string or binary to the String
and no tags anymore.
This could be how types are mapped:
Golang Type | Erlang/Elixir Type |
---|---|
etf.String | string/list (list of integers 0-255) |
etf.Charlist | list (list of integers with valid codepoints) |
string | binary |
The above follows the semantics and memory layout of each platform.
can't agree with that
string -> binary
For the case Ergo<->Ergo we should be able to work with native types. Thats why Ergo encodes string https://github.com/halturin/ergo/blob/master/etf/encode.go#L399 as STRING_EXT https://erlang.org/doc/apps/erts/erl_ext_dist.html#string_ext
For the case Ergo<->Ergo, there should be no problem:
Ergo | External Term Format | Ergo |
---|---|---|
string | binary | string |
etf.String | string | etf.String |
Ergo at its current state works poorly with compliant-implementation of OTP such as Elixir. Which shows ergo implementation of ETF/OTP might need another look.
Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP on the wire.
the main idea of Ergo is to bring the cool stuff from the Erlang to the Golang world. It was never been a "driver" for the "idiomatic" access to the Erlang cluster. So having native types for the Ergo<->Ergo interaction is more prioritized and having smooth access to the erlang data types - is a bonus.
Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP on the wire. Not sure what the reasoning behind this is.
Especially if we understand that Golang strings can contain more than just range 0-255, which https://erlang.org/doc/apps/erts/erl_ext_dist.html#string_ext obviously cannot.
A standard idiomatic golang approach to this would be to create a subtype that restricts golang strings to only 0-255, or throw an error. That could be etf.String
So having native types for the Ergo<->Ergo interaction is more prioritized
This seems like a type mismatch during mapping.
The proposed solution is more "native" yes (because of what golang strings are) ?
Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP.
for the Ergo-Ergo I have no idea why should I care about it. UTF8 - its about the representation set of bytes. In Golang you can easily cast string to the []byte or []rune.
That's because in the Golang ecosytem and stdlib, strings are expected to contain UTF-8 as a first-class concept.
Any usage of values returned by golang libraries (other than ergo) would mean that we got to check if the string contains UTF-8 and do the casting to []byte. Values from common golang libraries may contain deeply nested strings. This is nasty User Experience.
Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP.
Especially if we understand that Golang strings can contain more than just range 0-255, which https://erlang.org/doc/apps/erts/erl_ext_dist.html#string_ext obviously cannot.
A standard idiomatic golang approach to this would be to create a subtype that restricts golang strings to only 0-255, or throw an error. That is etf.String
I think this is enough said. Other more senior members of the Erlang / Elixir community may chime in in the future and offer their views (on the ergo implementation).
Maybe it's best for our team here to maintain a fork this library and name it ergo2.
up to you )
Was expecting a more open community here, that is open to feedback, but alas...
"Binary sharing occurs whenever binaries are taken apart. This is the fundamental reason why binaries are fast, decomposition can always be done with O(1) complexity." -Erlang Team (who designed the ETF encoding format)
"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes." -Rob Pike (23 October 2013)
that's why Ergo uses STRING_EXT for the encoding strings and it's a convenient way for the case Ergo-Ergo.
Erlang handles it as a string if it has numbers 0-255 only (non UTF in terms of Erlang data types) and treats it as a byte list otherwise. Using etf.String for the encoding as a binary and etf.Charlist for the sending as a list of numbers would be enough to solve this issue.
OTP has a lot of good ideas but not all of them are good enough. Ergo has its own way :)
PS: To be an "open community" doesn't mean accept everything from anyone. It's an open-source project with MIT license. Nobody pays me for this work. You are welcome :)
Forgot to mention... If this feature is pretty important we could discuss a private repo for your company.
just released 2.0.0 with support of Erlang/Elixir strings.
List of non negative integers sent from Elixir is decoded as string by ergo.
There should be a way to configure or disable this "Heuristic String Detection"?