dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
397 stars 72 forks source link

`TypeError: data type 'varchar(3)' not understood` #197

Open lucharo opened 3 years ago

lucharo commented 3 years ago

Unable to load ORC table with varchar columns even though varchar is supported (or that's my understanding from here). The table I am trying to read is in ORC format and the query I am trying to write is simply a select * from table kind of query. I am using a pyhive cursor to read the data and I am able to read text tables in the HDFS but I am struggling with ORC tables. Find below the full stack:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-341970b7c111> in <module>
----> 1 query_dask('select * from vca_europe.tcaf_iss_cs')

~/Internal/multiverse/visamultiverse/dasksql.py in query_dask(query, c, output_type, verbose, **kwargs)
    136         ## 2. Next, get table locations and create tables
    137         for table in tables:
--> 138             create_dask_table(table,c = c, verbose = verbose)
    139 
    140         ## 3. Replace . by _ in table definitions in query

~/Internal/multiverse/visamultiverse/dasksql.py in create_dask_table(table, c, verbose, **kwargs)
    103         if verbose: print(f'Creating table {table} in Dask-SQL Context...')
    104 
--> 105         c.create_table(c_table, cursor, hive_table_name = table, **kwargs)
    106 
    107     def query_dask(query,

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/context.py in create_table(self, table_name, input_table, format, persist, **kwargs)
    198             format=format,
    199             persist=persist,
--> 200             **kwargs,
    201         )
    202         self.tables[table_name.lower()] = dc

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/input_utils/convert.py in to_dc(cls, input_item, table_name, format, persist, **kwargs)
     52             table = dd.concat([filled_get_dask_dataframe(item) for item in input_item])
     53         else:
---> 54             table = filled_get_dask_dataframe(input_item)
     55 
     56         if persist:

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/input_utils/convert.py in <lambda>(*args)
     46         """
     47         filled_get_dask_dataframe = lambda *args: cls._get_dask_dataframe(
---> 48             *args, table_name=table_name, format=format, **kwargs,
     49         )
     50 

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/input_utils/convert.py in _get_dask_dataframe(cls, input_item, table_name, format, **kwargs)
     70             ):
     71                 return plugin.to_dc(
---> 72                     input_item, table_name=table_name, format=format, **kwargs
     73                 )
     74 

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/input_utils/hive.py in to_dc(self, input_item, table_name, format, **kwargs)
    133                 location = partition_table_information["Location"]
    134                 table = wrapped_read_function(
--> 135                     location, partition_column_information, **kwargs
    136                 )
    137 

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/input_utils/hive.py in wrapped_read_function(location, column_information, **kwargs)
    109 
    110             for col, expected_type in column_information.items():
--> 111                 df = cast_column_type(df, col, expected_type)
    112 
    113             return df

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/mappings.py in cast_column_type(df, column_name, expected_type)
    271     )
    272 
--> 273     casted_column = cast_column_to_type(df[column_name], expected_type)
    274 
    275     if casted_column is not None:

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask_sql/mappings.py in cast_column_to_type(col, expected_type)
    298 
    299     logger.debug(f"Need to cast from {current_type} to {expected_type}")
--> 300     return col.astype(expected_type)

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/dask/dataframe/core.py in astype(self, dtype)
   2666             meta = self._meta_nonempty.astype(dtype)
   2667         else:
-> 2668             meta = self._meta.astype(dtype)
   2669         if hasattr(dtype, "items"):
   2670             set_unknown = [

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5875         else:
   5876             # else, only a single dtype is given
-> 5877             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5878             return self._constructor(new_data).__finalize__(self, method="astype")
   5879 

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    629         self, dtype, copy: bool = False, errors: str = "raise"
    630     ) -> "BlockManager":
--> 631         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    632 
    633     def convert(

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    425                     applied = b.apply(f, **kwargs)
    426                 else:
--> 427                     applied = getattr(b, f)(**kwargs)
    428             except (TypeError, NotImplementedError):
    429                 if not ignore_failures:

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    624 
    625         if dtype is not None:
--> 626             dtype = pandas_dtype(dtype)
    627 
    628         # may need to convert to categorical

/projects/gds/chavesrl/condapv/envs/visaverse-gpu/lib/python3.7/site-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1797     # raise a consistent TypeError if failed
   1798     try:
-> 1799         npdtype = np.dtype(dtype)
   1800     except SyntaxError as err:
   1801         # np.dtype uses `eval` which can raise SyntaxError

TypeError: data type 'varchar(3)' not understood

Note: I am using a wrapper I have written around dask-sql

nils-braun commented 3 years ago

Hi @lucharo ! Thanks also for this issue. Do you maybe have an example file for me to try out? I am not really an expert in ORC files, but I do not see a reason why it should not work (other than a bug), so this is definitely a valid bug report. I will try to set up some test setup by myself, but that might take a while.

(Of course I would be very happy to hear about your wrapper ;-))

nils-braun commented 3 years ago

Sorry, one additional question: which version of dask-sql are you using?

lucharo commented 3 years ago

Hello @nils-braun, thanks for the swift reply :) I am running dask-sql 0.3.6 installed via pip. Let me try if I can create a toy ORC file example and send it over to you!

nils-braun commented 3 years ago

@lucharo - did you succeed in creating a test OCR file?