ergo-services / ergo

An actor-based Framework with network transparency for creating event-driven architecture in Golang. Inspired by Erlang. Zero dependencies.
https://docs.ergo.services
MIT License
3.51k stars 138 forks source link

Disable Heuristic String Detection in ergo #59

Closed heri16 closed 2 years ago

heri16 commented 3 years ago

List of non negative integers sent from Elixir is decoded as string by ergo.

There should be a way to configure or disable this "Heuristic String Detection"?

halturin commented 3 years ago

It's the nature of Erlang. Here is my explanation to the same question https://github.com/halturin/ergo/issues/44#issuecomment-829329054

heri16 commented 3 years ago

Yes, but other libraries like https://github.com/kbrw/node_erlastic or https://github.com/rusterlium/rustler allows to configure how they are decoded to suit Elixir semantics/convention?

image

halturin commented 3 years ago

I would say it's really bad practice.

heri16 commented 3 years ago

Does it mean that this library will not have built in support for UTF-8 strings from Elxiir?

halturin commented 3 years ago

It's not related to UTF support. Strings in Golang are immutable. So if you got a binary and you want to treat it as a string just cast it to the string type.

heri16 commented 3 years ago

The below seems to result in two different encodings on the receiver side.

This sends {[104, 105, 32, 230, 151, 165, 230, 156, 172, 32, 240, 159, 154, 128]} over the wire.

reply := etf.Term(etf.Tuple{etf.Atom("error"), etf.Atom("unknown_request")})
reply = etf.Tuple{"hi 日本語 🚀"}
return "reply", reply, state

This sends {"hi 日本語 🚀"} over the wire.

reply := etf.Term(etf.Tuple{etf.Atom("error"), etf.Atom("unknown_request")})
reply = etf.Tuple{[]byte("hi 日本語 🚀")}
return "reply", reply, state

Does every nested possibly UTF-8 string in reply need to be casted to []byte(str)?

Golang string type are natively UTF-8, sending them as charlists (that are encoded incorrectly) seems counter-intuitive.

halturin commented 3 years ago

It's not a problem of Ergo :). Go is a strongly, statically typed language. It means the Ergo encoder knows for sure what exact type of data it trying to encode. Any string will be encoded as a string type. On the Erlang side - there is magic in the air :).

halturin commented 3 years ago

Golang string type are natively UTF-8, sending them as charlists (that are encoded incorrectly)

not sure if I follow you here

heri16 commented 3 years ago

See https://blog.golang.org/strings

string = readonly []byte

Strings in golang are semantically equivalent to binaries in Elixir/Erlang.

The ergo etf encoder encodes them as charlists, which produces invalid output (that cannot be decoded).

heri16 commented 3 years ago

Any good encoder should consider the semantics and underlying layout of the golang environment, yes?

heri16 commented 3 years ago

See how unicode is handled in Erlang:

https://erlang.org/doc/apps/stdlib/unicode_usage.html#the-interactive-shell

As the UTF-8 encoding is widely spread and provides some backward compatibility in the 7-bit ASCII range, it is selected as the standard encoding for Unicode characters in binaries for Erlang.

heri16 commented 3 years ago

ergo tries too hard to encode go string into charlist. When it should be just encoding string into binary. (Like most other libraries in golang ecosystem, written by experienced teams that appreciates the modern semantics and underlying layout of the golang types).

The actual charlist encoded by ergo eft is...

iex> to_string([104, 105, 32, 230, 151, 165, 230, 156, 172, 32, 240, 159, 154, 128])
<<104, 105, 32, 195, 166, 194, 151, 194, 165, 195, 166, 194, 156, 194, 172, 32,
  195, 176, 194, 159, 194, 154, 194, 128>>

The correct encoding marshalled by ergo eft should have been...

iex> to_string([104, 105, 32, 26085, 26412, 35486, 32, 128640])
"hi 日本語 🚀"
heri16 commented 3 years ago

Ultimately there should be a way to configure or disable this "Heuristic String Detection" when handling lists from Erlang/Elixir.

And maybe a way to also configure or disable this "Heuristic List Encoding" when handling unicode strings from Go.

heri16 commented 3 years ago

Erlang side - there is magic in the air :).

Regrettably, magic doesn't really work out for us. 🥇

halturin commented 3 years ago

I finally got the point ) sorry for the misunderstanding. Working on an improvement of handling Erlang/elixir charlist strings. (there will be a struct tag "charlist")

halturin commented 3 years ago

done. pushed to the master. please, let me know if you find any issue with that.

for sending charlist from the Ergo to Erlang it should be explicitly defined as a struct tag 'charlist'

type Struct struct {
   A string `etf:"fieldA charlist"`
}

on an Erlang side, it will be a map like this

#{'fieldA' => "Hello World! 🚀"}

handle received "term" from the Erlang side should be used TermMapIntoStruct function

a := Struct{}
TermMapIntoStruct(term, &a)

or for the Tuple value

{ "Hello World! 🚀"}

should be used TermIntoStruct

a := Struct{}
TermIntoStruct(term, &a)
heri16 commented 3 years ago

Thanks for the commit! I've upgraded to latest version (v1.2.5-0.20210731234859-3217bf775f6e) from master branch.

However, the below still sends a charlist instead of a binary (https://erlang.org/doc/apps/erts/erl_ext_dist.html#bit_binary_ext)

reply = etf.Tuple{"hi 日本語 🚀"}
return "reply", reply, state
iex> GenServer.call({ :example, :'demo@127.0.0.1' }, :hello)
{[104, 105, 32, 230, 151, 165, 230, 156, 172, 232, 170, 158, 32, 240, 159, 154, 128]}

Is there a way to configure this, when sending data from golang to erlang, considering that golang strings are just []byte ?

halturin commented 3 years ago

charlist - is a struct tag :) it must be applied to the struct field

heri16 commented 3 years ago

I think there is a mix-up between this issue and the other one: https://github.com/halturin/ergo/issues/58

This issue is about disabling "Heuristic String Detection" in ergo that results in native strings in golang being encoded into etf List, while the other issue #58 is about Structs annotations/tags.

As mentioned by Erlang documentation itself: "String does not have a corresponding Erlang representation"

A golang string is not equal to an ETF string.

A golang string is equal to an ETF bitstring (binary).

See: https://blog.golang.org/strings - "It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."

See: https://erlang.org/doc/apps/erts/erl_ext_dist.html#binary_ext - "This term represents a bitstring whose length in bits have to be a multiple of 8 bits."

halturin commented 3 years ago

we can not just enable/disable the conversion of 'charlist' to the string and back as it affects the whole node. The only way to do this for the specific data is using a struct tag. It means to send 'charlist' from the golang side you should use the struct with 'charlist' tag

type Struct struct {
   A string `etf:"fieldA charlist"`
}

reply := Struct{"hi 日本語 🚀"}
return "reply", reply, state
heri16 commented 3 years ago

See: https://blog.golang.org/strings - "It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."

The intention is to send all golang strings as an ETF bitstring / binary, instead of a charlist.

heri16 commented 3 years ago

The node logic could use etf.String instead of golang string.

Add etf.String to etf.go:

type Atom string
type String string
halturin commented 3 years ago

I would suggest making this way for the charlist as well

type Charlist string // encodes as a List
type String string // encodes as a binary

TermToStruct/TermMapToStruct will be updated accordingly - detect destination type and convert from List to the Charlist string or binary to the String

and no tags anymore.

heri16 commented 3 years ago

This could be how types are mapped:

Golang Type Erlang/Elixir Type
etf.String string/list (list of integers 0-255)
etf.Charlist list (list of integers with valid codepoints)
string binary

The above follows the semantics and memory layout of each platform.

halturin commented 3 years ago

can't agree with that

string -> binary

For the case Ergo<->Ergo we should be able to work with native types. Thats why Ergo encodes string https://github.com/halturin/ergo/blob/master/etf/encode.go#L399 as STRING_EXT https://erlang.org/doc/apps/erts/erl_ext_dist.html#string_ext

heri16 commented 3 years ago

For the case Ergo<->Ergo, there should be no problem:

Ergo External Term Format Ergo
string binary string
etf.String string etf.String

Ergo at its current state works poorly with compliant-implementation of OTP such as Elixir. Which shows ergo implementation of ETF/OTP might need another look.

Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP on the wire.

halturin commented 3 years ago

the main idea of Ergo is to bring the cool stuff from the Erlang to the Golang world. It was never been a "driver" for the "idiomatic" access to the Erlang cluster. So having native types for the Ergo<->Ergo interaction is more prioritized and having smooth access to the erlang data types - is a bonus.

heri16 commented 3 years ago

Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP on the wire. Not sure what the reasoning behind this is.

Especially if we understand that Golang strings can contain more than just range 0-255, which https://erlang.org/doc/apps/erts/erl_ext_dist.html#string_ext obviously cannot.

A standard idiomatic golang approach to this would be to create a subtype that restricts golang strings to only 0-255, or throw an error. That could be etf.String

heri16 commented 3 years ago

So having native types for the Ergo<->Ergo interaction is more prioritized

This seems like a type mismatch during mapping.

The proposed solution is more "native" yes (because of what golang strings are) ?

halturin commented 3 years ago

Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP.

for the Ergo-Ergo I have no idea why should I care about it. UTF8 - its about the representation set of bytes. In Golang you can easily cast string to the []byte or []rune.

heri16 commented 3 years ago

That's because in the Golang ecosytem and stdlib, strings are expected to contain UTF-8 as a first-class concept.

heri16 commented 3 years ago

Any usage of values returned by golang libraries (other than ergo) would mean that we got to check if the string contains UTF-8 and do the casting to []byte. Values from common golang libraries may contain deeply nested strings. This is nasty User Experience.

halturin commented 3 years ago

Sending a UTF8 string between Ergo <-> Ergo is currently not idiomatic OTP.

Especially if we understand that Golang strings can contain more than just range 0-255, which https://erlang.org/doc/apps/erts/erl_ext_dist.html#string_ext obviously cannot.

A standard idiomatic golang approach to this would be to create a subtype that restricts golang strings to only 0-255, or throw an error. That is etf.String

I think this is enough said. Other more senior members of the Erlang / Elixir community may chime in in the future and offer their views (on the ergo implementation).

Maybe it's best for our team here to maintain a fork this library and name it ergo2.

up to you )

heri16 commented 3 years ago

Was expecting a more open community here, that is open to feedback, but alas...

"Binary sharing occurs whenever binaries are taken apart. This is the fundamental reason why binaries are fast, decomposition can always be done with O(1) complexity." -Erlang Team (who designed the ETF encoding format)

"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes." -Rob Pike (23 October 2013)

halturin commented 3 years ago

that's why Ergo uses STRING_EXT for the encoding strings and it's a convenient way for the case Ergo-Ergo.

Erlang handles it as a string if it has numbers 0-255 only (non UTF in terms of Erlang data types) and treats it as a byte list otherwise. Using etf.String for the encoding as a binary and etf.Charlist for the sending as a list of numbers would be enough to solve this issue.

OTP has a lot of good ideas but not all of them are good enough. Ergo has its own way :)

PS: To be an "open community" doesn't mean accept everything from anyone. It's an open-source project with MIT license. Nobody pays me for this work. You are welcome :)

halturin commented 3 years ago

Forgot to mention... If this feature is pretty important we could discuss a private repo for your company.

halturin commented 2 years ago

just released 2.0.0 with support of Erlang/Elixir strings.