cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
727 stars 249 forks source link

SHOW PARTITIONS queries return None for partitions. #31

Closed dacjames closed 9 years ago

dacjames commented 9 years ago

My team has noticed a bug in processing SHOW PARTITIONS queries.

When executing the query in impala-shell, these queries work as expected:

[our.impala.server.url:21000] > show partitions dcollins.table_name;
Query: show partitions dcollins.table_name
+-------+-------+-----+-------+--------+--------+--------------+---------+
| year  | month | day | #Rows | #Files | Size   | Bytes Cached | Format  |
+-------+-------+-----+-------+--------+--------+--------------+---------+
| 2014  | 9     | 25  | -1    | 1      | 3.94MB | NOT CACHED   | PARQUET |
| Total |       |     | -1    | 1      | 3.94MB | 0B           |         |
+-------+-------+-----+-------+--------+--------+--------------+---------+

However, executing the same query through the impyla API, the partition columns (year, month, day) all come back as None.

conn = impala.dbapi.connect(host=our.impala.server.url, port=21050)
cursor = conn.cursor()
cursor.execute('show partitions dcollins.table_name')
for row in cursor:
    print row

# prints:
# (None, None, None, -1, 1, '3.94MB', 'NOT CACHED', 'PARQUET')
# (None, None, None, -1, 1, '3.94MB', '0B', '')
# should print:
# (2014, 9, 25, -1, 1, '3.94MB', 'NOT CACHED', 'PARQUET')
# ('Total', None, None, -1, 1, '3.94MB', '0B', '')
# or something like that.

From digging through the code, it appears that the data is already broken when coming back in the Thrift response so I have been unable to fix this issue myself. This is a problem for our team because we have not been able to figure out how to reliably determine what partitions have been added to Impala.

We see the same behavior running SHOW PARTITIONS on any of our tables but for reference the Impala Table in this example looks like (sensative names removed):

+----------------------------------------------------------------------------+
| result                                                                     |
+----------------------------------------------------------------------------+
| CREATE TABLE dcollins.table_name (                                         |
|   column_1 TIMESTAMP,                                                      |
|   column_2 STRING,                                                         |
|   column_3 STRING,                                                         |
|   column_4 STRING                                                          |
| )                                                                          |
| PARTITIONED BY (                                                           |
|   year INT,                                                                |
|   month INT,                                                               |
|   day INT                                                                  |
| )                                                                          |
| STORED AS PARQUET                                                          |
| LOCATION 'hdfs://our.namenode.url:8020/warehouse/dcollins.db/table_name'   |
| TBLPROPERTIES ('transient_lastDdlTime'='1412106677')                       |
+----------------------------------------------------------------------------+

EDIT: I have tried both the release 0.8.1 version and the latest git master.

laserson commented 9 years ago

Thanks for reporting this. I can reproduce the issue as well. I'll update soon.

laserson commented 9 years ago

The Impala shell uses beeswax, while the default for impyla is HiveServer2. This is probably a problem with the server-side implementation of HiveServer2, which I'll look into more and/or file a Jira.

As a temporary workaround, if you're willing to use impyla's latest git master, then you can choose whether to use HiveServer2 or beeswax, e.g.,

conn = connect(host='impalad.host', port=21000, protocol='beeswax')

Note that the port must be set to the beeswax service (same port as the impala shell). Also note that beeswax is (currently) considerably faster than HiveServer2. However, it hasn't been tested as much, and beeswax is not as feature-rich as HS2, so it's possible that some things might break. Either way, that may be a suitable solution for the moment.

laserson commented 9 years ago

Tracking this here: https://issues.cloudera.org/browse/IMPALA-1330

dacjames commented 9 years ago

Thank you, @laserson. The workaround you suggested works for now and we'll follow the impala jira for a more permanent fix.

henryr commented 9 years ago

I fixed this for Impala 2.0, which will be available pretty soon.

laserson commented 9 years ago

Thanks, @henryr! I'll close this issue.