basho / riak-erlang-client

The Riak client for Erlang.
Apache License 2.0
311 stars 188 forks source link

How to convert from strings to binaries for Riak TS [JIRA: CLIENTS-945] #306

Closed lukebakken closed 8 years ago

lukebakken commented 8 years ago

In PR #305 I added a query() overload that converts the string to a binary using unicode:characters_to_binary.

I noticed that in other places, iolist_to_binary is used for converting things like table names.

So, I want to see which function we should use to do these conversions.

Also, see this:

https://github.com/klacke/yaws/issues/212

hmmr commented 8 years ago

iolist_to_binary was initially chosen (by me) for two reasons: (a) it accepts the output of io_lib:format and, more importantly, (b) it does not throw if the resulting binary turns out to be invalid UTF-8. Because table names are expected to be well-formed unicode strings elsewhere in riak_kv, it would make sense to detect malformed strings at port of entry, which now makes me regret the easier way I have chosen.

A proper solution would be, then, to use characters_to_unicode, but also catch and report errors, right away.

lukebakken commented 8 years ago

@hmmr - what about queries? I assume they should be valid unicode as well?

@macintux - thoughts on all this?

lukebakken commented 8 years ago

@hmmr - where in riak_kv is unicode expected rather than just any set of bytes?

hmmr commented 8 years ago

Here is one instance where table name needs to be a valid unicode: https://github.com/basho/riak_kv/blob/riak_ts-develop/src/riak_kv_ts_svc.erl#L584.

Queries are received and handled as strings. The parser, however, only accepts ASCII: https://github.com/basho/riak_ql/blob/develop/src/riak_ql_lexer.xrl#L179-L181. This means, accepting unicode table names is unsafe as they cannot be queried.

(Interestingly, data of type varchar can be anything. They can be inserted by riakc_ts:put and fetched by riakc_ts:get, but can only appear in riakc_ts:query if they are plain ASCII.)

Update: To clarify: ts_cluster_unicode uses unicode in queries, for string literals but not for identifier names.

lukebakken commented 8 years ago

Here's what I gather from this discussion so far:

Questions:

hmmr commented 8 years ago

My opinion is to enforce unicode for all entities (table and field names) passing through the Erlang (and all other) clients. Even though essential operations in riak_kv are unicode-agnostic, whenever it comes to producing human-readable output, things may break depending on who does the printing, for no perceptible benefit. For example, I expect riak_shell to run into issues.

Basho-JIRA commented 7 years ago

Fixed, or closed via GitHub issues.

[posted via JIRA by Alexander Moore]