Closed lukebakken closed 8 years ago
iolist_to_binary
was initially chosen (by me) for two reasons: (a) it accepts the output of io_lib:format
and, more importantly, (b) it does not throw if the resulting binary turns out to be invalid UTF-8. Because table names are expected to be well-formed unicode strings elsewhere in riak_kv, it would make sense to detect malformed strings at port of entry, which now makes me regret the easier way I have chosen.
A proper solution would be, then, to use characters_to_unicode
, but also catch and report errors, right away.
@hmmr - what about queries? I assume they should be valid unicode as well?
@macintux - thoughts on all this?
@hmmr - where in riak_kv
is unicode expected rather than just any set of bytes?
Here is one instance where table name needs to be a valid unicode: https://github.com/basho/riak_kv/blob/riak_ts-develop/src/riak_kv_ts_svc.erl#L584.
Queries are received and handled as strings. The parser, however, only accepts ASCII: https://github.com/basho/riak_ql/blob/develop/src/riak_ql_lexer.xrl#L179-L181. This means, accepting unicode table names is unsafe as they cannot be queried.
(Interestingly, data of type varchar
can be anything. They can be inserted by riakc_ts:put
and fetched by riakc_ts:get
, but can only appear in riakc_ts:query
if they are plain ASCII.)
Update: To clarify: ts_cluster_unicode uses unicode in queries, for string literals but not for identifier names.
Here's what I gather from this discussion so far:
unicode:characters_to_binary
should be the preferred way to convert strings to binaries in the Erlang client for table names and queries.ASCII
so the above is moot at this time.Questions:
My opinion is to enforce unicode for all entities (table and field names) passing through the Erlang (and all other) clients. Even though essential operations in riak_kv are unicode-agnostic, whenever it comes to producing human-readable output, things may break depending on who does the printing, for no perceptible benefit. For example, I expect riak_shell to run into issues.
Fixed, or closed via GitHub issues.
[posted via JIRA by Alexander Moore]
In PR #305 I added a
query()
overload that converts the string to a binary usingunicode:characters_to_binary
.I noticed that in other places,
iolist_to_binary
is used for converting things like table names.So, I want to see which function we should use to do these conversions.
Also, see this:
https://github.com/klacke/yaws/issues/212