colinmarc / impala-ruby

an impala client for ruby
MIT License
34 stars 22 forks source link

enable to set "hadoop_user" and "configraton" to a query #7

Closed komamitsu closed 10 years ago

komamitsu commented 10 years ago

I want to set "hadoop_user" and "configuration" to a query and created this pull request. Please see it :)

Thanks

komamitsu commented 10 years ago

ping

colinmarc commented 10 years ago

sorry, I totally forgot! What are these used for? I've never seen either option.

komamitsu commented 10 years ago

Thanks for the reply.

What are these used for? I've never seen either option.

When using impala-shell.sh, "hadoop_user" and "configuration" are passed to Impala implicitly.

In Impala log:

I1031 10:13:02.235416  6460 impala-beeswax-server.cc:133] query(): query=select count(1) from sample
I1031 10:13:02.235528  6460 impala-beeswax-server.cc:447] query: Query {
  01: query (string) = "select count(1) from sample",
  03: configuration (list) = list<string>[0] {
  },
  04: hadoop_user (string) = "cloudera",
}

Regarding "hadoop_user", it's used as an username in metastore.

For "configuration", it's used as a query option like this:

Impala-shell:

[localhost.localdomain:21000] > set mem_limit=700000000000000000;
MEM_LIMIT set to 700000000000000000
[localhost.localdomain:21000] > select * from sample;
   :

log:

I1031 10:26:10.191804  6461 impala-beeswax-server.cc:133] query(): query=select * from sample
I1031 10:26:10.191920  6461 impala-beeswax-server.cc:447] query: Query {
  01: query (string) = "select * from sample",
  03: configuration (list) = list<string>[1] {
    [0] = "MEM_LIMIT=700000000000000000",
  },
  04: hadoop_user (string) = "cloudera",
}
I1031 10:26:10.191988  6461 impala-beeswax-server.cc:460] TClientRequest.queryOptions: TQueryOptions {
  01: abort_on_error (bool) = false,
  02: max_errors (i32) = 0,
  03: disable_codegen (bool) = false,
  04: batch_size (i32) = 0,
  05: num_nodes (i32) = 0,
  06: max_scan_range_length (i64) = 0,
  07: num_scanner_threads (i32) = 0,
  08: max_io_buffers (i32) = 0,
  09: allow_unsupported_formats (bool) = false,
  10: default_order_by_limit (i64) = -1,
  11: debug_action (string) = "",
  12: mem_limit (i64) = 700000000000000000,
  13: abort_on_default_limit_exceeded (bool) = false,
  14: parquet_compression_codec (i32) = 5,
  15: hbase_caching (i32) = 0,
  16: hbase_cache_blocks (bool) = false,
}

Sometimes, I need to change QueryOption and User with impala-ruby library.

Thanks

colinmarc commented 10 years ago

Oh, fascinating. How does it work with just a list<string>?

I'll leave some comments on the diff - also, would you mind adding some tests for the interface and one or two that actually hit the server?

colinmarc commented 10 years ago

Ohhh, I see why it's a list<string>. I think I'd like to expose a better interface. Maybe any key other than hadoop_user could be passed into config? so something like:

query.hadoop_user = options.delete(:hadoop_user) if options[:hadoop_user]
query.configuration = options.map do |key, value|
  "#{key}=#{value}"
end

Also, it'd be really nice if we could verify against some list of valid options, but I don't think we can do that without asking the server.

komamitsu commented 10 years ago
query.configuration = options.map do |key, value|
  "#{key}=#{value}"
end

looks good. I'll change it like this.

query.hadoop_user = options.delete(:hadoop_user) if options[:hadoop_user]

I think we might as well separate hadoop_user and configuration according to beeswax.thrift.

komamitsu commented 10 years ago

also, would you mind adding some tests for the interface and one or two that actually hit the server?

okay. I'll do it this weekend.

colinmarc commented 10 years ago

I'd rather not have configuration be a separate param - I know it is separate in the thrift, but it makes a bad interface.

komamitsu commented 10 years ago

Hi @colinmarc , just refactored my pull request based on your comments. Can you look at it?

colinmarc commented 10 years ago

Awesome - this is looking great. Can you write one or two connected tests that actually hit the server, so we can be sure we're setting this correctly? If it's hard to check the result of some configuration, don't worry about doing that, but I'd like to at least make sure everything still works if you pass in configuration.

komamitsu commented 10 years ago

I added a test for query options into test/test_impala_connected.rb and saw that Impalad received the query options properly as follows:

I1105 09:35:36.021066 17227 impala-beeswax-server.cc:133] query(): query=select "foo" AS foo
I1105 09:35:36.021133 17227 impala-beeswax-server.cc:447] query: Query {
  01: query (string) = "select \"foo\" AS foo",
  03: configuration (list) = list<string>[2] {
    [0] = "MEM_LIMIT=1234567890",
    [1] = "MAX_SCAN_RANGE_LENGTH=1073741824",
  },
  04: hadoop_user (string) = "someoneelse",
}
I1105 09:35:36.021191 17227 impala-beeswax-server.cc:460] TClientRequest.queryOptions: TQueryOptions {
  01: abort_on_error (bool) = false,
  02: max_errors (i32) = 0,
  03: disable_codegen (bool) = false,
  04: batch_size (i32) = 0,
  05: num_nodes (i32) = 0,
  06: max_scan_range_length (i64) = 1073741824,
  07: num_scanner_threads (i32) = 0,
  08: max_io_buffers (i32) = 0,
  09: allow_unsupported_formats (bool) = false,
  10: default_order_by_limit (i64) = -1,
  11: debug_action (string) = "",
  12: mem_limit (i64) = 1234567890,
  13: abort_on_default_limit_exceeded (bool) = false,
  14: parquet_compression_codec (i32) = 5,
  15: hbase_caching (i32) = 0,
  16: hbase_cache_blocks (bool) = false,
}

But I think it's impossible to check if Impalad received the expected query options in the test code.

colinmarc commented 10 years ago

Yeah, that was my suspicion. It's still a useful test, though, because at some point in the future Impala could choke or the thrift classes could change or something, and we want to know if everything works smoothly.

komamitsu commented 10 years ago

Well, I think so, too. Then can you merge the branch? Actually, I tied up now and it's not easy to take time for this branch right now. Thanks.

colinmarc commented 10 years ago

Sorry that I keep dropping the ball on this. This looks great! Thanks so much.

komamitsu commented 10 years ago

Thanks!