BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.94k stars 184 forks source link

[BUG] Error when reading table with hive cursor that does not happen with hdfs #1563

Open lucharo opened 3 years ago

lucharo commented 3 years ago

Describe the bug I get the following error when creating a table with a pyhive cursor:

from blazingsql import BlazingContext
from pyhive import hive

cursor = hive.Connection(
        host="{hive_edge_node_url}",
        username = getuser(),
        auth='KERBEROS',
        kerberos_service_name="hive",
        configuration = {'hive.execution.engine': "tez", 'tez.queue.name': "group1"}
    ).cursror()

bc = BlazingContext()

bc.create_table('bliblu',
                cursor, 
                hive_table_name = 'transuk2m2019_mini',
                hive_database_name = 'chavesrl')

Error:

ERROR: Could not get partition values for file: hdfs://anahnn/visa/user/chavesrl/chavesrl.db/transuk2m2019_mini/000000_0
ERROR: Could not get partition values for file: hdfs://anahnn/visa/user/chavesrl/chavesrl.db/transuk2m2019_mini/000001_0
ERROR: Could not get partition values for file: hdfs://anahnn/visa/user/chavesrl/chavesrl.db/transuk2m2019_mini/000002_0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-edf72ca4fd46> in <module>
      4     hive_table_name = 'transuk2m2019_mini',
      5     hive_database_name = 'chavesrl',
----> 6     file_format = 'parquet'
      7 )

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pyblazing/apiv2/context.py in create_table(self, table_name, input, **kwargs)
   2458             ):
   2459                 parsedMetadata = self._parseMetadata(
-> 2460                     file_format_hint, table.slices, parsedSchema, kwargs
   2461                 )
   2462 

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pyblazing/apiv2/context.py in _parseMetadata(self, file_format_hint, currentTableNodes, schema, kwargs)
   2714         schema["names"] = [i.encode() for i in schema["names"]]
   2715         if "names" in kwargs:
-> 2716             kwargs["names"] = [i.encode() for i in kwargs["names"]]
   2717 
   2718         if self.dask_client:

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pyblazing/apiv2/context.py in <listcomp>(.0)
   2714         schema["names"] = [i.encode() for i in schema["names"]]
   2715         if "names" in kwargs:
-> 2716             kwargs["names"] = [i.encode() for i in kwargs["names"]]
   2717 
   2718         if self.dask_client:

AttributeError: 'bytes' object has no attribute 'encode'

The table I am trying to read is parquet but specifying that does not helo either, the problem I've found enabling the debugger is that i.encode() is trying to encode i which is already a byte-string.

Expected behavior Column names being read properly. maybe pyblazing detecting the strings are already encoded

Environment overview (please complete the following information)

BlazingSQL version (git hash): ff4ece0366a4d76bf533baeb03dd03bdfc5232be
BlazingSQL branch name: HEAD
BlazingSQL branch tag: v0.19.0
BlazingSQL build id: 0
BlazingSQL compiler version: GNU /usr/bin/c++ 7.5.0
BlazingSQL cuda flags: -Xcompiler -Wno-parentheses -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 --expt-extended-lambda --expt-relaxed-constexpr -Werror=cross-execution-space-call -Xcompiler -Wall,-Wno-error=deprecated-declarations --default-stream=per-thread -DHT_DEFAULT_ALLOCATOR
BlazingSQL Operating system kernel: Linux-5.4.0-1038-aws
BlazingSQL Operating system architecture: x86_64
BlazingSQL Linux Operating system release: NAME=Ubuntu|VERSION=16.04.7 LTS (Xenial Xerus)|ID=ubuntu|ID_LIKE=debian|PRETTY_NAME=Ubuntu 16.04.7 LTS|VERSION_ID=16.04|HOME_URL=http://www.ubuntu.com/|SUPPORT_URL=http://help.ubuntu.com/|BUG_REPORT_URL=http://bugs.launchpad.net/ubuntu/|VERSION_CODENAME=xenial|UBUNTU_CODENAME=xenial
None

Environment details Please run and paste the output of the print_env.sh script here, to gather any other relevant environment details

Additional context Add any other context about the problem here.

----For BlazingSQL Developers---- Suspected source of the issue Where and what are potential sources of the issue

Other design considerations What components of the engine could be affected by this?